Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于 HN 格式图像数据的分析 #43

Closed
JeziL opened this issue Mar 9, 2020 · 89 comments
Closed

关于 HN 格式图像数据的分析 #43

JeziL opened this issue Mar 9, 2020 · 89 comments

Comments

@JeziL
Copy link
Member

JeziL commented Mar 9, 2020

看来 HN 格式的图像数据可能是由 BMP 格式中的 DIB Header调色板和压缩像素数据组成。

DIB Header 格式:
DIB Header format
调色板条目格式:

typedef struct tagRGBQUAD {
    BYTE rgbBlue;
    BYTE rgbGreen;
    BYTE rgbRed;
    BYTE rgbReserved;
}   RGBQUAD;

样本
样本

对照 DIB Header 的格式,可以发现 libreaderex.so 反编译源码CImage::DecodeJbig(int a1, int a2, int a3) 完成了对 DIB Header 的解析,并将相关参数传给了 JBigCodec::Decode 函数:

int __fastcall CImage::DecodeJbig(int a1, int a2, int a3)
{
  int v3; // r6@1
  int v4; // r7@1
  int v5; // r10@1
  int v6; // r4@1
  int v7; // r8@1
  int v8; // r9@1
  int v9; // r5@1
  int v10; // r0@1
  char v12; // [sp+14h] [bp-8054h]@1

  v3 = a3;
  // 宽度像素数
  v4 = (*(_WORD *)(a1 + 6) << 16) | *(_WORD *)(a1 + 4);
  // 高度像素数
  v5 = (*(_WORD *)(a1 + 10) << 16) | *(_WORD *)(a1 + 8);
  v6 = a1;
  v7 = a2;
  // 每行像素字节数
  v8 = 4 * ((v4 * *(_WORD *)(a1 + 14) + 31) / 32);
  v9 = (int)operator new(0x128u);
  CImage::CImage(v9, v6);
  // DIB Header 及调色板结束地址
  v10 = FindDIBBits(v6);
  JBigCodec::Decode((int)&v12, v10, v6 - v10 + v7, v5, v4 * *(_WORD *)(v6 + 14), v8, *(void **)(v9 + 8));
  if ( v3 )
    *(_DWORD *)v3 = *(_DWORD *)(v9 + 16);
  return v9;
}

压缩像素数据如何解码仍有待研究。

@lelandyang
Copy link
Contributor

lelandyang commented May 30, 2020

大佬有没有研究NH里面的压缩文字呢?我提供一下我研究的内容: COMPRESSTEXT后面紧跟着的两个字节好像是压缩文字的结束位置,而根据后面的压缩内容的头0x78da,应该是采用zlib最高等级的压缩,尝试解压成功。

@JeziL
Copy link
Member Author

JeziL commented May 30, 2020

大佬有没有研究NH里面的压缩文字呢?我提供一下我研究的内容: COMPRESSTEXT后面紧跟着的两个字节好像是压缩文字的结束位置,而根据后面的压缩内容的头0x78da,应该是采用zlib最高等级的压缩,尝试解压成功。

文字部分,zlib压缩和原始数据的情况都有。数据格式是包含文字的,可以看到以固定间隔排列的GB2312编码(小端序)的汉字,数据结构的其余部分确实像是某种索引,可能与文字在页面上的位置有关,具体还未深入研究。

@HinTak
Copy link
Contributor

HinTak commented Dec 12, 2020

Linux 版的appimage ( #51 ) 內藏有libreaderex_x64.so - 或者可以用python ctypes 當黑箱使用。

又,我手上的caj中有無數COMPRESSTEXTJFIF (jpeg 檔的header)

@HinTak
Copy link
Contributor

HinTak commented Dec 13, 2020

You can mount the appimage with the command below:

mount -o ro,loop,offset=187784 download.cnki.net/CAJViewer-x86_64-buildubuntu1604-201021.AppImage /mnt

Then you can trace for example, usage of FileStream::seek and any calls to *CAJ* with:

ltrace -C -x *CAJ*+\*seek\*@MAIN  -x \*seek\*@libreaderex_x64.so -o trace-log /mnt/usr/bin/cajviewer

This way you can probably try to understand a lot of what libreaderex_x64.so does, by studing how it moves around within the file.

It seems in my CAJ, just before the first COMPRESSTEXT is an array of [offsets + ?] (a 20-byte struct) which are page property structs, and the first page's offset is 8 byte before the first COMPRESSTEXT .

@JeziL
Copy link
Member Author

JeziL commented Dec 13, 2020

You can mount the appimage with the command below:

mount -o ro,loop,offset=187784 download.cnki.net/CAJViewer-x86_64-buildubuntu1604-201021.AppImage /mnt

Then you can trace for example, usage of FileStream::seek and any calls to *CAJ* with:

ltrace -C -x *CAJ*+\*seek\*@MAIN  -x \*seek\*@libreaderex_x64.so -o trace-log /mnt/usr/bin/cajviewer

This way you can probably try to understand a lot of what libreaderex_x64.so does, by studing how it moves around within the file.

It seems in my CAJ, just before the first COMPRESSTEXT is an array of [offsets + ?] (a 20-byte struct) which are page property structs, and the first page's offset is 8 byte before the first COMPRESSTEXT .

That sounds a quite promising approach.

@HinTak
Copy link
Contributor

HinTak commented Dec 13, 2020

Yes, it is. ltrace gives you the arguments passed to routines/methods, and the returned values, so it tells you a lot about any library you tries to trace, especially in this situation - where the main application is seperate from the library, and the interesting functionality is all in one library, with fairly clearly named calls. The windows dll's is accessed by ordinals - anonymous numbered routines, which is a lot less friendly or convenient for reverse-engineering.

The problems I have / had is (1) you need to filter away uninteresting information, like those involving std::*, (2) I only have one CAJ I like to read.

@HinTak
Copy link
Contributor

HinTak commented Dec 13, 2020

And the Linux shared library, like the android one, contains quite easy-to-understand class and method names, like CAJReader::Open()!

@HinTak
Copy link
Contributor

HinTak commented Dec 14, 2020

I think the beginning description about the header is completely wrong (or does not describe the HN file I have.

After the toc is "20-byte page struct" * "nun pages" . Immediately after the page struct is the "page content". The first int in the 20-byte is the offset to the page content, the 2nd int is the size of the text part.

Each page content is 8 bytes of I don't know what, then "COMPRESSTEXT" then, up to the size described in the page struct.

Right after that is another 8 bytes of I don't know, then size of image data, then the image itself. In my 71 pages document, 28 of the page at this point is just a jpeg image.

So the middle of the file is toc (= toc_entry * num_toc), page info struct * num_page, page content 1 (compress text then image), page content 2 ...

What I don't know are:

  • the 20-byte page struct, first int is offset to compress text, 2nd int is size of compress text; don't know the meaning of the 3rd int; 4th and 5th are always zero in my case

  • the 8-byte text header before "COMPRESSTEXT"

  • the 8-byte header for image, before the size and the image data

@HinTak
Copy link
Contributor

HinTak commented Dec 14, 2020

REMOVING ERROR COMMENTS THAT WAS PROVED TO BE FALSE.

不是 N, X, Y; 第三個int,
0x00 02 E6 C8 是JPEG file size. FF D8 FF E0 已是jpeg header.

@HinTak
Copy link
Contributor

HinTak commented Dec 15, 2020

See #53 for my HN parsing code - I added a "cai2pdf parse ..." option to just show content, and with partial hexdump too.

example output:

Page Text Header dump:
0000   03 80 E0 16 03 80 0F 21 43 4F 4D 50 52 45 53 53    .......!COMPRESS
0010   54 45 58 54 B8 01 00 00 78 DA 93 69 60 60 38 07    TEXT....x..i``8.

size of image data = 472750 (JPEG)
Page Image Header dump:
0000   02 00 00 00 C5 4D 00 00 AE 36 07 00 FF D8 FF E0    .....M...6......
0010   00 10 4A 46 49 46 00 01 01 00 00 01 00 01 00 00    ..JFIF..........
size of image data = 10864 (JBIG?)
Page Image Header dump:
0000   00 00 00 00 8E 88 07 00 70 2A 00 00 28 00 00 00    ........p*..(...
0010   40 09 00 00 54 0D 00 00 01 00 01 00 00 00 00 00    @...T...........

My current thinking is that the HN format is like the djvu format - pages are full-page images plus text; If a page has any image at all, it becomes a jpeg for color/gray (including the text as graphics). For pure text pages, it is 1-bit type for pure b/w images like text converted to 1-bit bitmap.

I have two question I don't know the answer yet:

  • image type 2 is definitely jpeg; I don't know what image type 0 is. It is jbig or jbig2 , and how to read that.

  • it seems that all my COMPRESSTEXT is zlib compressed; the int after COMPRESSTEXT is the expanded size - expanding the data as stream basically always expand to that size... but I can't read the expand content. @JeziL You wrote
    "固定间隔排列的GB2312编码(小端序)" - can you update my code (with some *.decode("gb2312")) to show what you meant?

@HinTak
Copy link
Contributor

HinTak commented Dec 15, 2020

I have managed to figure out"固定间隔排列的GB2312编码(小端序)", and pushed the code out in #53 ; it seems that gbk/gb18030 works marginally better, but all have some rubbish between the expected text.

@HinTak
Copy link
Contributor

HinTak commented Dec 15, 2020

Updated the code in #53 - gbk/gb18030 is better. I see why I am seeing rubbish - it is the nature of the document I have: it is a thesis on music analysis 音樂分析論文。當中的樂譜部分不是文字,轉成文字自然成為亂碼,很正常。

So my parse HN options 轉純文字檔很成功, 除了有些圖形生成亂碼。我認為是最初OCR光學認字的問題,不是我寫的程式部分的不足。

@HinTak
Copy link
Contributor

HinTak commented Dec 15, 2020

With the dump image code: -
9b2cd32
I take some of my comments back.

The images are either jpeg or DIB, according to the linux utility file (version 5.39):

image_dump_0001.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3422, components 3
image_dump_0002.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0003.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0004.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0005.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0006.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0007.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0008.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0009.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0010.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0011.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0012.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0013.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3412, components 3
image_dump_0014.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0015.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3412, components 3
image_dump_0016.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0017.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0018.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0019.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0020.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0021.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3412, components 3
image_dump_0022.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2367x3407, components 3
image_dump_0023.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2348x3375, components 3
image_dump_0024.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0025.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2344x3411, components 3
image_dump_0026.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2312x3445, components 3
image_dump_0027.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2318x3420, components 3
image_dump_0028.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0029.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2352x3382, components 3
image_dump_0030.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2343x3409, components 3
image_dump_0031.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0032.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0033.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0034.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2362x3405, components 3
image_dump_0035.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0036.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2351x3405, components 3
image_dump_0037.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0038.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0039.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0040.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3412, components 3
image_dump_0041.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2368x3412, components 3
image_dump_0042.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0043.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0044.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0045.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2311x3442, components 3
image_dump_0046.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2306x3443, components 3
image_dump_0047.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2312x3417, components 3
image_dump_0048.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0049.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0050.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0051.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0052.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0053.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0054.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2342x3411, components 3
image_dump_0055.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2339x3423, components 3
image_dump_0056.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2306x3445, components 3
image_dump_0057.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2346x3429, components 3
image_dump_0058.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2335x3416, components 3
image_dump_0059.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2305x3441, components 3
image_dump_0060.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2332x3415, components 3
image_dump_0061.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2337x3424, components 3
image_dump_0062.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2340x3408, components 3
image_dump_0063.dat: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 2309x3440, components 3
image_dump_0064.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0065.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0066.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0067.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0068.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0069.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0070.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m
image_dump_0071.dat: Device independent bitmap graphic, 2368 x 3412 x 1, image size 0, resolution 11811 x 11811 px/m

The jpeg files are just plain valid jpeg files, but seems to be vertically flipped (upside-down), and can be viewed with any JPEG image viewer software. The DIB images is yet unknown.

However, it is quite clear that they are all full-page images of very similar widths and heights.
The new parse option already extract text quite correctly, so the only task now is to figure out what the DIB image is.

HN files seem very similar to the djvu format - with full page images (jpeg for any pages with figures, otherwise jbig for 1-bit b/w images) for each page, plus a transparent colourless text overlay layer, so that cut-and-paste text out of the document can work.

@HinTak
Copy link
Contributor

HinTak commented Dec 16, 2020

There are 5 kinds of images possible; jbig and jpeg are the two most common ones. The other 3 are jpeggray, jbig2 and jpeg2000.

Pdf supports all of them except jbig. Djvu uses jbig natively (and others?) but not jbig2.

So conversion to djvu is slightly easier. Conversion to pdf would require decompressing jbig (and recompress with ccitt g4 or deflate) to be compatible with pdf.

For jbig, the header is 40 bytes of DIB then 8 bytes of palette.

@HinTak
Copy link
Contributor

HinTak commented Dec 16, 2020

nm -C -D gives quite good prototype information. One can probably write a new header file and link against libreaderex_x64.so to study it.

@lelandyang
Copy link
Contributor

But for JBig encoding, it is quite weired, because the Jbig encoding does not have a header, which is consisted of a BIE(binary image entry), and ITU recommends to use tiff to encapsulate the encoding, I wonder if the encoding was just taking part of jbig standard or a modified version of jbig/jbig2.

@HinTak
Copy link
Contributor

HinTak commented Dec 16, 2020

Jbig was quite popular in the early 2000 because of djvu (and scanned documents to the format). Pdf had jbig2 to get better file size a bit later.

The Linux shared library libreaderex_x64.so has quite a lot of GPL /BSD libraries included. That's probably against a few of their license terms...

@lelandyang
Copy link
Contributor

lelandyang commented Dec 17, 2020

Agreed.
Jbig continues to be a popular fmt in digital signature image encoding such as the one by Union pay. It's safe to use jbig cause all jbig related patents from IBM and Mitsubishi have expired.
JPEG2000 is a complete failure due to expensive patent fees, what's interesting is that among dozens of caj samples I collected, there were no JP2K sample found.
So long as we can decode jbig and jbig2 format, the NH format is done. We can perform OCR with a more precise engine such as Abbyy or Tesseract.
In terms of license infringement, the only declaration made, according to BSD clauses, was in cajviewer about dialog, which is incomplete.
But my point is that: the jbig/jbig2 encoding seen in caj files are not standardized version. They are very likely to be a modified version, the reason was stated in the above issue, you may refer to jbig file format proposed by ITU or standard document on ISO.

@HinTak
Copy link
Contributor

HinTak commented Dec 17, 2020

I tried jbig-kit without success . The variant used in caj is most likely some kind of headless jbig where width/height etc info is stored elsewhere (in the DIB header).

@lelandyang
Copy link
Contributor

That is what we have to figure out.
JPEG & JPEG-Gray formats are known to us, and yet jbig/jb2 encoding are unknown.
JPEG are mostly used as cover encoding, and rarely found in a thesis body sections due to its dissatisfying compression ratio.
As for JPEG2000, it was not found in the samples I have at hand.
As it seems, all jbig encoded images in one caj file share at least the first byte, for many cases, the first 2 bytes are the same. In an extreme case, two jbig encoded files have first 16 bytes identical. I made, therefore, 2 assumptions:

  • The jbig body contains another header;
  • The first line specifies several configurations share among all jbig images from a file, such as strip count or bit plane count.

@HinTak
Copy link
Contributor

HinTak commented Dec 17, 2020

See above - in the 71-page caj file I am interested in, 28 pages (including page 1 / cover) are jpeg, 43 with a DIB header + palette (48 bytes). The first 4 bytes afterwards are all distinct among the 43 cases. It is 0x4B 0xC6 or 0xC7, then 2 very dramatically different bytes.

The width/height is described in the DIB header, so there is no need for a BIE...

Btw, the extracted jpeg is both up-side-down and a different color. (the front cover is black text on light brown patterned background, but the background appears as blue in the jpeg). There is probably a color transform matrix somewhere.

@lelandyang
Copy link
Contributor

lelandyang commented Dec 17, 2020

Sorry for disappointing you, but you by chance took a very particular NH file :-D . For example, sample from issue: #7 consists of only 1 jpeg as cover, the remaining ones are jbig.
"The first 4 bytes afterwards are all distinct among the 43 cases": I would recommend you to attach your sample here so that I can test on that, please note that I am referring to jbig encoded pages, NOT jpeg. You may refer to the screenshot I attached below. The 2 compared images are page2 and page3 extracted from the sample in issue #7.
As for "up-side-down", possibly true, but I don't know if there are bytes to indicate its orientation.
无标题

@HinTak
Copy link
Contributor

HinTak commented Dec 17, 2020

Your two pages have 4B C6 after the 48 byte DIB + palette header, as I described. So they are just like my 43 pages.

The 40 bytes are as described in the very first post #43 (comment) . It is followed by FF FF FF 00 00 00 00 00 00 , which is a two-colour palette.

Jpeg images in caj does not have a 48 byte DIB - as jpeg is self-describing (width/height etc are all part of the jpeg data).

@lelandyang
Copy link
Contributor

Buddy, DIB is speific to jbig.
JBig in caj is a customized format, the name jbig may only refers to pixel encoding, DIB is part of the Jbig image, Mr.Jezil describe it as jbig header, he has already made it clear in this issue.
I mean the body part share several bytes.

@HinTak
Copy link
Contributor

HinTak commented Dec 17, 2020

Your two images share two bytes - 4B C6 . Out of my 43, some of them have 4B C6, some of them 4B C7 . (and I have 28 jpeg images which are completely different, to make 43+28=71).

I have traced the execution to a routine named JBigCodec::Decode, which takes the pointer from 4B onwards as input plus width , height from the DIB header. That's why I am sure that it is a some header-less JBig variant.

@lelandyang
Copy link
Contributor

Yes, but makes it easier than objdump to locate the procedures and calls.
JBIG is just a name they give, it is very likely not the standard jbig encoding.
The real decoding happens in LowestDecodeLine in which arithmetic decoding was done.

@HinTak HinTak assigned JeziL and lelandyang and unassigned JeziL and lelandyang Dec 26, 2020
@HinTak
Copy link
Contributor

HinTak commented Dec 29, 2020

Actually img2pdf does not use any native/compiled code. It does however, use PIL for two purposes: it uses PIL to convert 1-bit images to ccitt g4, and also uses it to convert misc unknown image types to types which can be embedded in pdf's.

So I made a private copy of it and renamed it cajimg2pdf (need a better name - I was only going to take the pdfwriting part, but haven't removed the unused part yet) and already gotten rid of the first usage of PIL . The rest is two changes to it: the default dpi (was 96 like bmp, but is 300 according to my caj sample which seems to be US Letter size scanned to 300 dpi), and include the image flip inside it. It turns out it already does the matrix transform I wrote about to center the image in a page. My postscript was rusty - it is "1 0 0 1 xoff yoff cm" instead of "/Matrix [1 0 0 1 xoff yoff]" - and doing "1 0 0 -1 xoff -yoff cm" is with image flip included. So I only need to insert two "-" to get it to invert by default. (it is not [yet] setable as an option).

PIL is too large a dependency to add, so I'd like to remove it; but other than that, img2pdf saves me a few hundred lines of pdf-writing python code. (it is under a LGPL license, which means you need to include source code if you change it, but pure python code is always with source code included anyway, so we just need to add a block of comments at the top about any major change we made or to make, to use it in a modified form).

@HinTak
Copy link
Contributor

HinTak commented Dec 29, 2020

I took a look at the older caj to pdf code to see if I could reuse them for HN to pdf - it looks like it is mainly a lot of pdf internal bits? Anyway, I also wrote a small "magic" file so you can use the GNU command file to identify which kind of caj files a sample is, (see https://github.com/caj2pdf/CAJSamples) and I call the ones with "CAJ" as header the "canonical variant".

@lelandyang
Copy link
Contributor

Actually, you don't have to look at the old caj codes. Because those caj files are not scanned, but converted from PDF documents uploaded by graduate students when they passed defense and obtain their dilopma.

@HinTak
Copy link
Contributor

HinTak commented Dec 30, 2020

I only look at one, and converted one to pdf, to see how the conversion code works and whether I can re-use some. As you know, the normal caj files seems to be pdf contents re-arranged, and are not similar to the HN files at all. The HN files are mostly collection of full-page images, hence adapting img2pdf seem to be the quickest (if img2pdf's large dependency of other software can be removed).

Mostly I am taking the same approach as on libreaderex_x64.so - make it work on linux first, then try to remove/change/rewrite the part which does not work on other platforms.

@lelandyang
Copy link
Contributor

the seul problem here is that jbig /jbig2 decoding is still a huge problem, because with this at hand, we can not port to another platform.
Unluckily, I am quite sick of late and do not have the time and energy to reverse the dlls.

@HinTak
Copy link
Contributor

HinTak commented Dec 30, 2020

I have pushed my stripped down, enhanced and renamed img2pdf out, plus the convert_hn method. So actually for my caj file, it works perfectly for caj2pdf convert now.

  • So yes: it is still Linux-specific and depends on libreaderex_x64.so being around.

  • my caj has no secondary figures (just full page figure every page). This is quite rare - every HN files I have collected under CAJSamples has at least 7-8 smaller figures. The simplest one is issue 29 . I checked that the figure position offset is not next to the image, and all the data area is used. So that left one possibility: COMPRESSTEXT is not just text. It seems to be positioning and size code for TEXT . That would make sense to have areas defined against a piece of text so that cut-and-paste out of the GUI viewer works. I haven't checked if it is possible to do text cut and paste text under Linux yet, but this is a tricky area on linux in general, as it is a form of interaction/co-operation/conversion between two applications and there is no window's DCOM / OLE equivalent on linux.

  • anyway, the positioning code for additional figures see to be at the end of COMPRESSTEXT section!

  • I checked, besides (1) off-Centre left-right wise (one of the examples in CAJSamples certainly is), (2) figure overlay low-res drawn area, not blanks, in full page background image, there is a 3rd problem: some page have large blank areas but small additional image - an example is a university logo on the 2nd or 3rd page of one CAJ file, the page is still mostly empty. So guessing where to put a figure definitely does not work - we need to find and understand the positioning code at the end of COMPRESSTEXT.

Since we got GetBit, the 4 methods I don't have are:
Decode1, the single argument method of Decode, GetCX, LowestDecodeLine.

I think I started on LowestDecodeLine but haven't reached a state I am happy with.

It is the holiday so have plenty of rests...

@lelandyang
Copy link
Contributor

Decode1, the single argument method of Decode, GetCX, LowestDecodeLine. You just listed 3 functions to be re-implemented, but what is the 4th one?
As for the method I propose, the doctoral thesis sample from university of science and technology of china does have 2 pages contains university logo, but this is a rare case and, my idea is firstly make the document understandable: with illustrations and background readable. Because most university provides a thesis template of word in a simple format.
GetCX means to get context within the decoder. I will post decompiled source when I feel better.
Than you very much.

@HinTak
Copy link
Contributor

HinTak commented Dec 30, 2020

the single argument method of Decode(). There are two Decode() methods, another with 6 arguments which is very easy to write.

The current code now put such secondary images on separate pages after the main one.

One of the CAJ files in CAJSamples, the one about paintings and drawing styles, have some off-centre decoration images on the side of the page.

@lelandyang
Copy link
Contributor

As was mentioned before in my previous comments, the function GetCX is not too complique. The code below is for your reference.

__int64 __fastcall JBigCodec::GetCX(JBigCodec *this, int a2, int a3)
{
  int v3; // ST00_4
  int v4; // ST1C_4
  int v5; // ST1C_4
  int v6; // ST1C_4
  int v7; // ST1C_4

  v3 = a3;
  v4 = 2 * (unsigned __int64)JBigCodec::GetBit(this, a2 - 1, a3 + 2);
  v5 = 2 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 1, v3 + 1) + v4);
  v6 = 8 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 1, v3) + v5);
  v7 = 2 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 2, v3 + 1) + v6);
  return 2 * ((unsigned int)JBigCodec::GetBit(this, a2 - 2, v3) + v7);
}

@HinTak
Copy link
Contributor

HinTak commented Dec 31, 2020

Thanks. The v6 = 8... looks strange. You sure it is not 2 instead?

@HinTak
Copy link
Contributor

HinTak commented Dec 31, 2020

Decode(int) and Decode1(int) almost looks identical - I should check which difference there is... I don't know what they are yet. So it is almost done.

RenormDe is largely this:

while (??? <= 0x7FFF)
{
If (!this->excess_bits)
{ 
this->ByteIn();
}
this->decode_state *=2 +1;
this->excess_bits --;
}
if (!this->excess_bits)
   this->ByteIn();

@HinTak
Copy link
Contributor

HinTak commented Jan 1, 2021

The last few routines are definitely looking like jbig - there are 4 large look-up tables /arrays from the jbig specification.

@lelandyang
Copy link
Contributor

Decode(int) and Decode1(int) almost looks identical - I should check which difference there is... I don't know what they are yet. So it is almost done.

RenormDe is largely this:

while (??? <= 0x7FFF)
{
If (!this->excess_bits)
{ 
this->ByteIn();
}
this->decode_state *=2 +1;
this->excess_bits --;
}
if (!this->excess_bits)
   this->ByteIn();

Yes, you are right, the two functions resembles but there are still minor diffs. The main diffs lies in an else clause. I should say the Decode1() function is more clear and has, at least theoretically, better performance from the perspective of its control flow.
As I put it before, the decoding happens in LowestDecodeLine(), which calls Decode1() where there are tables such as NLPS & NMPS tables.
Two decode functions enclosed, it may reduce some labor.
Desktop.zip

@HinTak
Copy link
Contributor

HinTak commented Jan 3, 2021

Thanks. Only those two routines remain now, so this come at a good time.

I went through the assembly listing and saw that about 7-8 instructions out of 170+ differ and is moved to a different place. So that are very similar. Also I found that they probably contains heavily inlined versions of LpsExchange and MpsExchange, which are themselves unused. This is quite similar to the situation with the Android version of GetCX. In the x86_64 version of GetCX, it is basically 4 /5 calls to GetBit. In the android version GetBit are inlined to GetCX.

I am reading ITU-81(?) the JBig spec, which contains the NLPS , NMPS, LSZ and SWITCH tables and a description of LpsExchange/MpsExchange.

@lelandyang
Copy link
Contributor

Yes, I think it's better to take a look back at ITU.82 (as I remember it).
I suppose it is a tricky implementation.

@HinTak
Copy link
Contributor

HinTak commented Jan 7, 2021

I finished writing and debugging a new JBigCodec (only the decoding part). It generates byte-wise identical output to libreaderex_x64.so. And "caj2pdf convert ..." can convert most HN files, so I think this almost can be closed. Except for these issues:

  • smaller images per page are put onto a separate smaller pages, instead of overlaying
  • you don't need libreaderex_x64.so any more, but you still need to compile the small python ctypes module. I crossed compiled it and committed it as libjbigdec-w32/w64.dll. Windows users don't need a compiler but need a small adjustment at the top of jbigdec.py to pick the right one. Sorry I don't use windows and certainly don't use python on windows so one of you can add and test such adjustment...
  • adding a modified version of img2pdf probably adds hidden dependency on PIL. I'd like to remove dependency on PIL totally; partly for windows but also partly because it is a large dependency.
  • some very old HN files (only one sample exists - further up back in this issue) needs DecodeJBig2 - this seems to be a very small wrapper around libpoppler/xpdf's JBIG2Stream code - it is under GPL so the CAJViewer authors have definitely violated the license terms for not releasing libreaderex_x64.so in source code form... Anyway, I can probably recreate the wrapper in a few hours, since it is very simple and does not require any reverse-engineering.

I think we can close this issue as it is getting rather long, and just continue with filing the items above as new issues.

After having written a byte-wise compatible version of JBigCodec, I would say that the algorithm is exactly like ITU.82 whenever there is a flow diagram in the ITU document!!!! But details without a flow diagram is just done wrongly !!!! - the context (the CX / GetBit routines) and the init LNTP templates (LowestDecode) are just strange, and hugely wasteful...

@HinTak
Copy link
Contributor

HinTak commented Jan 7, 2021

Finished and pushed the jbig2 decode routine out. So "caj2pdf convert ..." works on all HN variants, subjected to (1) you need to build two small shared libraries, single-file source provided (no more dependence on libreaderex_x64.so). (2) smaller images per pages are on their own separate pages.

The jbig2 data is really standard, so just 30 lines of c++ code is needed to wrap around libpoppler to make it imitate libreaderex_x64.so's routines. I suppose if you want smaller dependency you can use jbig2dec the library, but it is less common (mupdf/mutool/ghostscript uses it bundled) than libpoppler, and perhaps more tedious to use.

I think the branch can be merged and this closed, since we now can convert all the submitted HN files to pdf without using libreaderex_x64.so .

@JeziL , @lelandyang let me know if you feel strongly about not merging, etc.

@HinTak
Copy link
Contributor

HinTak commented Jan 7, 2021

I have updated the readme in the dev branch also, so when it is merged the readme would be updated with some HN info.

@HinTak
Copy link
Contributor

HinTak commented Jan 7, 2021

Added libjbig2dec-based jbig decoding also, besides the libpoppler one; and updated README and docs elsewhere.

@HinTak
Copy link
Contributor

HinTak commented Jan 8, 2021

Here is the converted result of that strange old HN file:

Windows9x_NT操作系统的磁盘备份与恢复的研究与实现_张宗伟.pdf

remaining issues are tracked in #57 .

I just checked that Fedora (I use) ships mingw crossed-compiled libpoppler, so I can actually build the JBIG2 modules as a dll for windows too. I still don't use python on windows, but I can commit the dlls into the repo and one of you can edit the python code slightly to look for the windows dll under windows instead of the shared library for linux.

@HinTak
Copy link
Contributor

HinTak commented Jan 9, 2021

I did the libpoppler based jbig2 dlls, but I don't like the result - it brings in gcc's c++ runtime dll (the equivalent of msvcr1xx / msvcp1xx from Microsoft) and a dozen libraries to about 60MB; so I switched to libjbig2dec for building the jbig2 dll.

libpoppler is more common in Linux (every pdf reader on linux depends on it to read pdfs, except Adobe's and ghostscript/mupdf), while only ghostscript and mupdf uses libjbig2dec which just provides jbig2 decoding.

Tested against windows python 3.5.x on wine and updating windows dll loading too - so it works on windows without needing to build shared libraries / dlls.

Last chance!!! - if I don't hear from you guys @JeziL @lelandyang , I am merging HN-dev onto master in a couple of days. " ... convert ..." works on windows as is (I built and bundled the dlls), while on linux/mac instructions is provided on how to build those to shared libraries (and it is quite obvious when they are missing and fail to load).

Following-up issues at #57

@HinTak HinTak closed this as completed Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants