Problem for creating RTF file for Japanese Language (Origin: bugzilla #437346) #2483

doxygen · 2018-07-01T22:56:26Z

status RESOLVED severity normal in component doxywizard for ---
Reported in version 1.5.6 on platform Other
Assigned to: Dimitri van Heesch

Original attachment names and IDs:

doxygen_test.zip (ID 88188)
refman.zip (ID 193349)
rtfgen.mod.zip (ID 193350)

On 2007-05-10 04:21:19 +0000, T.M wrote:

Please describe the problem:
I tried to create the RTF file by using doxygen, but it failes to
create it when specifying the Japanese language.

Steps to reproduce:
(Using doxywizard)

Choose "Japanese" for OUTPUT_LANGUAGE in the Project tab.

Check "GENERATE_RTF" in the RTF tab.

Use the simple C source code file for input
(it doesn't include any Japanese characters)
--- source file ----
void main( void )
{
printf( "Hello! World" );
}

After specifying "Working directory", push the "Start" button.

Actual results:
Doxygen produces the error below and the RTF file seems not correct.

Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch.
Please try to create a small code example that produces this error
and send that to dimitri@stack.nl.
*** Doxygen has finished

Expected results:
Creating proper RTF file.

Does this happen every time?
Yes.

Other information:
When I see the 'wrong' RTF file, I find a '}' character in the title string.
I think it may cause this problem. ( like "83}" )
== in RTF file ===================
{\title {\comment TEST '83'8A'83t'83@'83'8C'83'93'83X'83}'83j'83'85'83A'83'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid

(This line is for "Reference Manual" in English)

If I choose "Japanese-en" for the OUTPUT_LANGUAGE, it succeeds to create
a RTF file.

my OS : Windows XP Professional SP2 (Japanese)

On 2007-05-13 11:33:45 +0000, Dimitri van Heesch wrote:

Did you set INPUT_ENCODING to the correct value? If not the input is assumed to be encoded as UTF-8. You probably need to set it to EUC-JP, SHIFT_JIS, or EUC-JISX0213 in your case. Does this solve your problem?

On 2007-05-14 02:59:13 +0000, T.M wrote:

(In reply to comment # 1)
I tried several combinations for INPUT_ENCODING and DOXYFILE_ENCODING,
but they also failed. (same error was produced)

INPUT_ENCODING DOXYFILE_ENCODING result

UTF-8 UTF-8 fail
SHIFT_JIS UTF-8 fail
EUC-JP UTF-8 fail
SHIFT_JIS SHIFT_JIS fail
EUC-JP EUC-JP fail
UTF-8 SHIFT_JIS fail

On 2007-05-15 01:25:13 +0000, T.M wrote:

Created attachment 88188
Doxygen configuration file and source code

This file includes the configuration file, the source file,
and the RTF file generated by doxygen.
It may help to reproduce this problem.

On 2008-09-09 00:27:26 +0000, T.M wrote:

(In reply to comment # 0)
This problem occurs when a multibyte character includes
a special character, such as '}'(0x7D), '{'(0x7B) or ''(0x5C),
in the second byte. For example, the multibyte code 0x837D is
converted to "'83}" by the current software and the character
'}' causes the problem for a RTF file. I think the output code
should be "'83}" or "'83'7D".

If I change one of the function in the source code 'rtfgen.cpp'
to put the second multibyte code in the hex format, it seems
to work well. I confirmed this only for Japanese language,
so I'm not sure whether this modification causes the problem
for other lanugages or not.

=== file: rtfgen.cpp ==========================
void RTFGenerator::postProcess(QByteArray &a)
{
QByteArray enc(a.size()*4); // worst case
int off=0;
uint i;
uint mb_flag = 0; // <- Add
for (i=0;i<a.size();i++)
{
unsigned char c = (unsigned char)a.at(i);
if (c>0x80 || mb_flag==1) // <- Add (mb_flag==1)
{
char s[10];
sprintf(s,"\'%X",c);
qstrcpy(enc.data()+off,s);
off+=qstrlen(s);
mb_flag = 1 - mb_flag; // <- Add
}
else
{
enc.at(off++)=c;
}
}
enc.resize(off);
a = enc;
}

I hope this information is helpful to resolve the problem.

On 2008-10-12 11:10:44 +0000, Dimitri van Heesch wrote:

Thanks for the feedback, I plan to change the postProcess function like this:

void RTFGenerator::postProcess(QByteArray &a)
{
QByteArray enc(a.size()*4); // worst case
int off=0;
uint i;
bool mbFlag=FALSE;
for (i=0;i<a.size();i++)
{
unsigned char c = (unsigned char)a.at(i);
if (c>0x80 || mbFlag)
{
char s[10];
sprintf(s,"\'%X",c);
qstrcpy(enc.data()+off,s);
off+=qstrlen(s);
mbFlag=c>0x80;
}
else
{
enc.at(off++)=c;
}
}
enc.resize(off);
a = enc;
}

Do you see issues with this? The idea is escaping one character <0x80 after a sequence of one or more >0x80 characters.

On 2008-12-27 14:12:42 +0000, Dimitri van Heesch wrote:

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.5.8. Please verify if this is indeed the case and reopen the
bug if you think it is not fixed (include any additional information that you
think can be relevant).

On 2011-08-06 15:47:28 +0000, hiroa wrote:

Created attachment 193349
Japanese RTF source set.

OS is Windows.
INPUT_ENCODE is UTF-8.
OUTPUT_LANGUAGE is Japanese.

Wrong point
NG:æ§�'90ï½¬ç´¢å¼�, OK:æ§�æ��ç´¢å¼� defined in translator_jp.h
NG:æ©�è�½'82P, OK:æ©�è�½ï¼� defined in enum.h line 10.
NG??: \, OK??:\\ defined in enum.h line 9.

On 2011-08-06 15:59:41 +0000, hiroa wrote:

Created attachment 193350
sample RTF multibyte patch.

I want to use the Japanese RTF output(cp932).

The multi-byte encoding of the RTF generators has trouble from Version 1.5.8 to 1.7.4.

Version 1.5.8, If the second byte is more than 0x80, the third byte will be encoded unintentionally.
Since version 1.6.3, when the second byte 0x5c is not encoded escape, and '' appeared. As a result, the wrong expression.

I make the sample patch.

Code Pages Supported by Windows
http://msdn.microsoft.com/ja-jp/goglobal/bb964654.aspx

On 2011-08-06 16:26:59 +0000, Dimitri van Heesch wrote:

Hi Hiroa,

Thanks for your patch. I plan to introduce a more generic solution for RTF encoding, using the following change to encodeForOutput:

// note: function is not reentrant!
static void encodeForOutput(FTextStream &t,const QCString &s)
{
QCString encoding;
bool converted=FALSE;
int l = s.length();
static QByteArray enc;
if (l4>(int)enc.size()) enc.resize(l4); // worst case
encoding.sprintf("CP%s",theTranslator->trRTFansicp().data());
if (!encoding.isEmpty())
{
// convert from UTF-8 back to the output encoding
void *cd = portable_iconv_open(encoding,"UTF-8");
if (cd!=(void *)(-1))
{
size_t iLeft=l;
size_t oLeft=enc.size();
const char *inputPtr = s.data();
char *outputPtr = enc.data();
if (!portable_iconv(cd, &inputPtr, &iLeft, &outputPtr, &oLeft))
{
enc.resize(enc.size()-oLeft);
converted=TRUE;
}
portable_iconv_close(cd);
}
}
if (!converted) // if we did not convert anything, copy as is.
{
memcpy(enc.data(),s.data(),l);
enc.resize(l);
}
uint i;
for (i=0;i<enc.size();i++)
{
uchar c = (uchar)enc.at(i);
if (c>=0x80)
{
char esc[10];
sprintf(esc,"\'%X",c);
t << esc;
  // write 2nd byte
  i++;
  if (i<enc.size())
  {
    uchar c2 = (uchar)enc.at(i);
    sprintf(esc,"\\'%X",c2);
    t << esc;
  }

  if (((uchar)c&0xE0)==0xE0)
  {
    // write 3rd byte
    i++;
    if (i<enc.size())
    {
      uchar c3 = (uchar)enc.at(i);
      sprintf(esc,"\\'%X",c3);
      t << esc;
    }
  }
  if (((uchar)c&0xF0)==0xF0)
  {
    // write 4th byte
    i++;
    if (i<enc.size())
    {
      uchar c4 = (uchar)enc.at(i);
      sprintf(esc,"\\'%X",c4);
      t << esc;
    }
  }
}
else
{
  t << (char)c;
}
}
}

Can you check if this also works for you?

On 2011-08-07 03:09:39 +0000, hiroa wrote:

Hi Dimitri,

I read the source code. I did not understand code page you assumed.
It is handled incorrectly in Japanese (and perhaps Chinese and Korean).
I think it is better not change from the original patch if there is no mistake.

It is necessary to process the loop of for (i=0;i<enc.size();i++) according to the output code page of RTF.

In the character set before Unicode is standardized, The Single Byte Character Set(only single byte character) or The Double Byte Character Set(single byte character and double character) is most.

cp932 (DBCS: Japanese Shift-JIS)
cp936 (DBCS: Simplified Chinese GBK)
cp949 (DBCS: Korean)
cp950 (DBCS: Traditional Chinese Big5)
cp1252,1251 etc...(SBCS: LatinI, Cyrillic...)

Because these are similar, but are different, probably there is no generic solution.

Concretely, 0xB1 0x5c works with one character in GBK, but handles it in Shift-JIS with two characters.
cp936 :ç�¶(U+76F6)
cp932 :ï½±(U+FF71) (U+005C) Sorry, 0x5C is displayed '' in Shift-JIS. it is a backslash in ascii .

The "'B1'5C" encoding in GBK are correct.
but 0x5C of the second character must not encode in Shift-JIS.

In addition, 0x91 0x5C works with one character in GBK and Shift-JIS, but handles it in Latin-I with two characters.

cp936 :æ�(U+616D)
cp932 :æ�¾(U+66FE)
cp1252:â��(U+2018) (U+005C)

That's why the code page judgment should speed up, but do not change the part to change processing every language from an original patch greatly.

On 2011-08-07 07:34:38 +0000, Dimitri van Heesch wrote:

Hi Hiroa,

Thanks for your explanation. I see now why my proposed patch is wrong. I will use your patch instead. Thanks a lot for your help.

On 2011-08-07 07:55:48 +0000, Dimitri van Heesch wrote:

*** Bug 643068 has been marked as a duplicate of this bug. ***

On 2011-08-07 07:58:37 +0000, Dimitri van Heesch wrote:

*** Bug 166535 has been marked as a duplicate of this bug. ***

On 2011-08-14 14:04:44 +0000, Dimitri van Heesch wrote:

This bug was previously marked ASSIGNED, which means it should be fixed in
doxygen version 1.7.5. Please verify if this is indeed the case. Reopen the
bug if you think it is not fixed and please include any additional information
that you think can be relevant.

The text was updated successfully, but these errors were encountered:

doxygen closed this as completed Jul 1, 2018

doxygen added the doxywizard bug is specific for the wizard label Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem for creating RTF file for Japanese Language (Origin: bugzilla #437346) #2483

Problem for creating RTF file for Japanese Language (Origin: bugzilla #437346) #2483

doxygen commented Jul 1, 2018

Actual results:
Doxygen produces the error below and the RTF file seems not correct.

Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch.
Please try to create a small code example that produces this error
and send that to dimitri@stack.nl.
*** Doxygen has finished

Other information:
When I see the 'wrong' RTF file, I find a '}' character in the title string.
I think it may cause this problem. ( like "83}" )
== in RTF file ===================
{\title {\comment TEST '83'8A'83t'83@'83'8C'83'93'83X'83}'83j'83'85'83A'83'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid

INPUT_ENCODING DOXYFILE_ENCODING result

UTF-8 UTF-8 fail
SHIFT_JIS UTF-8 fail
EUC-JP UTF-8 fail
SHIFT_JIS SHIFT_JIS fail
EUC-JP EUC-JP fail
UTF-8 SHIFT_JIS fail

The "'B1'5C" encoding in GBK are correct.
but 0x5C of the second character must not encode in Shift-JIS.

cp936 :æ�(U+616D)
cp932 :æ�¾(U+66FE)
cp1252:â��(U+2018) (U+005C)

Problem for creating RTF file for Japanese Language (Origin: bugzilla #437346) #2483

Problem for creating RTF file for Japanese Language (Origin: bugzilla #437346) #2483

Comments

doxygen commented Jul 1, 2018

Actual results: Doxygen produces the error below and the RTF file seems not correct.

Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch. Please try to create a small code example that produces this error and send that to dimitri@stack.nl. *** Doxygen has finished

Other information: When I see the 'wrong' RTF file, I find a '}' character in the title string. I think it may cause this problem. ( like "83}" ) == in RTF file =================== {\title {\comment TEST '83'8A'83t'83@'83'8C'83'93'83X'83}'83j'83'85'83A'83'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid

INPUT_ENCODING DOXYFILE_ENCODING result

UTF-8 UTF-8 fail SHIFT_JIS UTF-8 fail EUC-JP UTF-8 fail SHIFT_JIS SHIFT_JIS fail EUC-JP EUC-JP fail UTF-8 SHIFT_JIS fail

The "'B1'5C" encoding in GBK are correct. but 0x5C of the second character must not encode in Shift-JIS.

cp936 :æ�­(U+616D) cp932 :æ�¾(U+66FE) cp1252:â��(U+2018) (U+005C)

Actual results:
Doxygen produces the error below and the RTF file seems not correct.

Error: RTF integrity test failed at line 117 of D:/doxygen/rtf/refman.rtf due to a bracket mismatch.
Please try to create a small code example that produces this error
and send that to dimitri@stack.nl.
*** Doxygen has finished

Other information:
When I see the 'wrong' RTF file, I find a '}' character in the title string.
I think it may cause this problem. ( like "83}" )
== in RTF file ===================
{\title {\comment TEST '83'8A'83t'83@'83'8C'83'93'83X'83}'83j'83'85'83A'83'8B {\s17\sa60\sb30\widctlpar\qj \fs22\cgrid

UTF-8 UTF-8 fail
SHIFT_JIS UTF-8 fail
EUC-JP UTF-8 fail
SHIFT_JIS SHIFT_JIS fail
EUC-JP EUC-JP fail
UTF-8 SHIFT_JIS fail

The "'B1'5C" encoding in GBK are correct.
but 0x5C of the second character must not encode in Shift-JIS.

cp936 :æ�(U+616D)
cp932 :æ�¾(U+66FE)
cp1252:â��(U+2018) (U+005C)