Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode/UTF-8 support to UrlEncode and UrlDecode #175

Open
parkone opened this issue Nov 8, 2015 · 10 comments
Open

Add Unicode/UTF-8 support to UrlEncode and UrlDecode #175

parkone opened this issue Nov 8, 2015 · 10 comments

Comments

@parkone
Copy link

parkone commented Nov 8, 2015

Function URLEncode in the WebHelpers module do not convert correctly Unicode characters!
When sending json with non-Unicode characters it will broken.
I suggest to use this function instead

'iconv+urlencode /based on ADODB.Stream (include a reference to a recent version of the "Microsoft ActiveX Data Objects" library in your project)
'http://stackoverflow.com/questions/218181/how-can-i-url-encode-a-string-in-excel-vba
'https://msdn.microsoft.com/en-us/library/ms681424%28v=vs.85%29.aspx
Public Function URLEncode(StringVal As Variant, Optional SpaceAsPlus As Boolean = False) As String

  Dim bytes() As Byte, b As Byte, i As Integer, space As String

  If SpaceAsPlus Then space = "+" Else space = "%20"

  If Len(StringVal) > 0 Then
    With New ADODB.stream
      .mode = adModeReadWrite
      .Type = adTypeText
      .Charset = "UTF-8"
      .Open
      .WriteText StringVal
      .Position = 0
      .Type = adTypeBinary
      .Position = 3 ' skip BOM
      bytes = .Read
    End With

    ReDim Result(UBound(bytes)) As String

    For i = UBound(bytes) To 0 Step -1
      b = bytes(i)
      Select Case b
        Case 97 To 122, 65 To 90, 48 To 57, 45, 46, 95, 126
          Result(i) = Chr(b)
        Case 32
          Result(i) = space
        Case 0 To 15
          Result(i) = "%0" & Hex(b)
        Case Else
          Result(i) = "%" & Hex(b)
      End Select
    Next i

    URLEncode = Join(Result, "")
  End If
End Function
@timhall
Copy link
Member

timhall commented Nov 13, 2015

@parkone Thanks for raising this issue. This is a known issue and has I'm researching approaches of resolving this that maintain Mac support (unfortunately, ADODB is not available on Mac). I'll keep you posted.

@bdr99
Copy link

bdr99 commented Mar 3, 2016

Is this issue fixed by #205?

@timhall
Copy link
Member

timhall commented Mar 7, 2016

@bdr99 it should be, but I'll add a test case.

@Sophist-UK
Copy link
Contributor

I have submitted another PR #213 to enhance UrlEncode further and to provide similar enhancements and fixes to UrlDecode also. This does NOT AFAIK provide full Unicode support - but it does handle ASCII characters above 128.

@timhall timhall changed the title URLEncode with unicode support Add Unicode/UTF-8 support to UrlEncode and UrlDecode Apr 3, 2016
@timhall
Copy link
Member

timhall commented Apr 3, 2016

Found this reference implementation for Java: https://www.w3.org/International/URLUTF8Encoder.java

All non-ASCII characters are encoded in two steps: first

  •    to a sequence of 2 or 3 bytes, using the UTF-8 algorithm;
    
  •    secondly each of these bytes is encoded as "%xx".
    
// relevant lines
} else if (ch <= 0x007f) {      // other ASCII
       sbuf.append(hex[ch]);
} else if (ch <= 0x07FF) {      // non-ASCII <= 0x7FF
       sbuf.append(hex[0xc0 | (ch >> 6)]);
       sbuf.append(hex[0x80 | (ch & 0x3F)]);
} else {                    // 0x7FF < ch <= 0xFFFF
       sbuf.append(hex[0xe0 | (ch >> 12)]);
       sbuf.append(hex[0x80 | ((ch >> 6) & 0x3F)]);
       sbuf.append(hex[0x80 | (ch & 0x3F)]);
}

@ghost
Copy link

ghost commented Jul 23, 2016

I don't know if you'll find this useful for the UTF8 decode, but I've been trying to decode URLs myself and after hours of searching I stumbled upon this: http://stackoverflow.com/questions/23810324/vba-convert-string-to-unicode
You'd need to convert the consecutive encoded hex values to a byte array, and then you can get their proper value with the function that hamish provided in his answer.
To convert an hex value to byte, you use val("&Hnn&"), where nn is the hex value without the percentage symbol (for instance, if I want to get the byte value of %84 then val("&H84&") will return 132).

Also, perhaps this bit here helps you convert text to an UTF8 byte array, perhaps if you combine it with the second part of parkone's answer you can achieve the encoding without the need of ADODB: http://www.di-mgt.com.au/howto-convert-vba-unicode-to-utf8.html

@justinmassiot
Copy link

I still have issues with non-ASCII characters with 4.1.1 . In Excel 2016 (Windows 7) when I replace WebHelpers.UrlEncode from WebRequest.cls by WorksheetFunction.EncodeURL - source - everything works like a charm.
I hope you could find a solution that would be portable (Win/Mac + compatible with older versions) and suitable for all character sets.

@grv87
Copy link

grv87 commented Feb 3, 2019

@justinmassiot, WorksheetFunction.EncodeURL works for Office 2013+ only, so it's not an option.

VBA-Web/src/WebHelpers.bas

Lines 997 to 1007 in f400fbc

' TODO For non-ASCII characters,
'
' FormUrlEncoded:
'
' Replace the character by a string consisting of a U+0026 AMPERSAND character (&), a "#" (U+0023) character,
' one or more ASCII digits representing the Unicode code point of the character in base ten, and finally a ";" (U+003B) character.
'
' Else:
'
' Encode to sequence of 2 or 3 bytes in UTF-8, then percent encode
' Reference Implementation: https://www.w3.org/International/URLUTF8Encoder.java

The first approach is non-standard and not supported by some (most?) servers, see Wikipedia.

I've implemented the second approach (not pushed to GitHub yet), and it works like a charm, except that I haven't tested it with surrogates. It may require additional code.
But I'm not sure this would be the best solution in terms of performance.
Solution proposed by @Seeker320 at http://www.di-mgt.com.au/howto-convert-vba-unicode-to-utf8.html and at #252 uses WinAPI WideCharToMultiByte function and may work faster.
But I don't know an alternative external function for Mac.
From Google it looks like mbsnrtowcs_l could do the job. However, I'm not sure how to pass UTF-16 locale to it.
Another option is to use iconv but this adds additional dependency.
I don't have a Mac to test. Is a developer with Mac experience here to test and decide?

Also, there are other places where UTF-8 support is required. The solution should be universal.

@timhall, what do you think?
Should we go the hard way (with probably efficient external functions) or the simple way (pure VBA solution)?

@grv87
Copy link

grv87 commented Feb 3, 2019

Also, I wonder whether VBA.StrConv(..., vbFromUnicode) calls in the code should actually use UTF-8 too.

@gnefnuy
Copy link

gnefnuy commented Nov 25, 2021

So, is it solved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants