New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于 HN 格式图像数据的分析 #43
Comments
大佬有没有研究NH里面的压缩文字呢?我提供一下我研究的内容: COMPRESSTEXT后面紧跟着的两个字节好像是压缩文字的结束位置,而根据后面的压缩内容的头0x78da,应该是采用zlib最高等级的压缩,尝试解压成功。 |
文字部分,zlib压缩和原始数据的情况都有。数据格式是包含文字的,可以看到以固定间隔排列的GB2312编码(小端序)的汉字,数据结构的其余部分确实像是某种索引,可能与文字在页面上的位置有关,具体还未深入研究。 |
Linux 版的appimage ( #51 ) 內藏有 又,我手上的caj中有無數 |
You can mount the appimage with the command below:
Then you can trace for example, usage of
This way you can probably try to understand a lot of what It seems in my CAJ, just before the first |
That sounds a quite promising approach. |
Yes, it is. ltrace gives you the arguments passed to routines/methods, and the returned values, so it tells you a lot about any library you tries to trace, especially in this situation - where the main application is seperate from the library, and the interesting functionality is all in one library, with fairly clearly named calls. The windows dll's is accessed by ordinals - anonymous numbered routines, which is a lot less friendly or convenient for reverse-engineering. The problems I have / had is (1) you need to filter away uninteresting information, like those involving |
And the Linux shared library, like the android one, contains quite easy-to-understand class and method names, like |
I think the beginning description about the header is completely wrong (or does not describe the HN file I have. After the toc is "20-byte page struct" * "nun pages" . Immediately after the page struct is the "page content". The first int in the 20-byte is the offset to the page content, the 2nd int is the size of the text part. Each page content is 8 bytes of I don't know what, then "COMPRESSTEXT" then, up to the size described in the page struct. Right after that is another 8 bytes of I don't know, then size of image data, then the image itself. In my 71 pages document, 28 of the page at this point is just a jpeg image. So the middle of the file is toc (= toc_entry * num_toc), page info struct * num_page, page content 1 (compress text then image), page content 2 ... What I don't know are:
|
不是 N, X, Y; 第三個int, |
See #53 for my HN parsing code - I added a "cai2pdf parse ..." option to just show content, and with partial hexdump too. example output:
My current thinking is that the HN format is like the djvu format - pages are full-page images plus text; If a page has any image at all, it becomes a jpeg for color/gray (including the text as graphics). For pure text pages, it is 1-bit type for pure b/w images like text converted to 1-bit bitmap. I have two question I don't know the answer yet:
|
I have managed to figure out"固定间隔排列的GB2312编码(小端序)", and pushed the code out in #53 ; it seems that gbk/gb18030 works marginally better, but all have some rubbish between the expected text. |
Updated the code in #53 - gbk/gb18030 is better. I see why I am seeing rubbish - it is the nature of the document I have: it is a thesis on music analysis 音樂分析論文。當中的樂譜部分不是文字,轉成文字自然成為亂碼,很正常。 So my parse HN options 轉純文字檔很成功, 除了有些圖形生成亂碼。我認為是最初OCR光學認字的問題,不是我寫的程式部分的不足。 |
With the dump image code: - The images are either jpeg or DIB, according to the linux utility
The jpeg files are just plain valid jpeg files, but seems to be vertically flipped (upside-down), and can be viewed with any JPEG image viewer software. The DIB images is yet unknown. However, it is quite clear that they are all full-page images of very similar widths and heights. HN files seem very similar to the djvu format - with full page images (jpeg for any pages with figures, otherwise jbig for 1-bit b/w images) for each page, plus a transparent colourless text overlay layer, so that cut-and-paste text out of the document can work. |
There are 5 kinds of images possible; jbig and jpeg are the two most common ones. The other 3 are jpeggray, jbig2 and jpeg2000. Pdf supports all of them except jbig. Djvu uses jbig natively (and others?) but not jbig2. So conversion to djvu is slightly easier. Conversion to pdf would require decompressing jbig (and recompress with ccitt g4 or deflate) to be compatible with pdf. For jbig, the header is 40 bytes of DIB then 8 bytes of palette. |
|
But for JBig encoding, it is quite weired, because the Jbig encoding does not have a header, which is consisted of a BIE(binary image entry), and ITU recommends to use tiff to encapsulate the encoding, I wonder if the encoding was just taking part of jbig standard or a modified version of jbig/jbig2. |
Jbig was quite popular in the early 2000 because of djvu (and scanned documents to the format). Pdf had jbig2 to get better file size a bit later. The Linux shared library |
Agreed. |
I tried jbig-kit without success . The variant used in caj is most likely some kind of headless jbig where width/height etc info is stored elsewhere (in the DIB header). |
That is what we have to figure out.
|
See above - in the 71-page caj file I am interested in, 28 pages (including page 1 / cover) are jpeg, 43 with a DIB header + palette (48 bytes). The first 4 bytes afterwards are all distinct among the 43 cases. It is 0x4B 0xC6 or 0xC7, then 2 very dramatically different bytes. The width/height is described in the DIB header, so there is no need for a BIE... Btw, the extracted jpeg is both up-side-down and a different color. (the front cover is black text on light brown patterned background, but the background appears as blue in the jpeg). There is probably a color transform matrix somewhere. |
Sorry for disappointing you, but you by chance took a very particular NH file :-D . For example, sample from issue: #7 consists of only 1 jpeg as cover, the remaining ones are jbig. |
Your two pages have 4B C6 after the 48 byte DIB + palette header, as I described. So they are just like my 43 pages. The 40 bytes are as described in the very first post #43 (comment) . It is followed by FF FF FF 00 00 00 00 00 00 , which is a two-colour palette. Jpeg images in caj does not have a 48 byte DIB - as jpeg is self-describing (width/height etc are all part of the jpeg data). |
Buddy, DIB is speific to jbig. |
Your two images share two bytes - 4B C6 . Out of my 43, some of them have 4B C6, some of them 4B C7 . (and I have 28 jpeg images which are completely different, to make 43+28=71). I have traced the execution to a routine named |
Yes, but makes it easier than objdump to locate the procedures and calls. |
Actually img2pdf does not use any native/compiled code. It does however, use PIL for two purposes: it uses PIL to convert 1-bit images to ccitt g4, and also uses it to convert misc unknown image types to types which can be embedded in pdf's. So I made a private copy of it and renamed it PIL is too large a dependency to add, so I'd like to remove it; but other than that, img2pdf saves me a few hundred lines of pdf-writing python code. (it is under a LGPL license, which means you need to include source code if you change it, but pure python code is always with source code included anyway, so we just need to add a block of comments at the top about any major change we made or to make, to use it in a modified form). |
I took a look at the older caj to pdf code to see if I could reuse them for HN to pdf - it looks like it is mainly a lot of pdf internal bits? Anyway, I also wrote a small "magic" file so you can use the GNU command |
Actually, you don't have to look at the old caj codes. Because those caj files are not scanned, but converted from PDF documents uploaded by graduate students when they passed defense and obtain their dilopma. |
I only look at one, and converted one to pdf, to see how the conversion code works and whether I can re-use some. As you know, the normal caj files seems to be pdf contents re-arranged, and are not similar to the HN files at all. The HN files are mostly collection of full-page images, hence adapting img2pdf seem to be the quickest (if img2pdf's large dependency of other software can be removed). Mostly I am taking the same approach as on libreaderex_x64.so - make it work on linux first, then try to remove/change/rewrite the part which does not work on other platforms. |
the seul problem here is that jbig /jbig2 decoding is still a huge problem, because with this at hand, we can not port to another platform. |
I have pushed my stripped down, enhanced and renamed img2pdf out, plus the convert_hn method. So actually for my caj file, it works perfectly for
Since we got GetBit, the 4 methods I don't have are: I think I started on LowestDecodeLine but haven't reached a state I am happy with. It is the holiday so have plenty of rests... |
Decode1, the single argument method of Decode, GetCX, LowestDecodeLine. You just listed 3 functions to be re-implemented, but what is the 4th one? |
the single argument method of Decode(). There are two Decode() methods, another with 6 arguments which is very easy to write. The current code now put such secondary images on separate pages after the main one. One of the CAJ files in CAJSamples, the one about paintings and drawing styles, have some off-centre decoration images on the side of the page. |
As was mentioned before in my previous comments, the function __int64 __fastcall JBigCodec::GetCX(JBigCodec *this, int a2, int a3)
{
int v3; // ST00_4
int v4; // ST1C_4
int v5; // ST1C_4
int v6; // ST1C_4
int v7; // ST1C_4
v3 = a3;
v4 = 2 * (unsigned __int64)JBigCodec::GetBit(this, a2 - 1, a3 + 2);
v5 = 2 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 1, v3 + 1) + v4);
v6 = 8 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 1, v3) + v5);
v7 = 2 * ((unsigned __int64)JBigCodec::GetBit(this, a2 - 2, v3 + 1) + v6);
return 2 * ((unsigned int)JBigCodec::GetBit(this, a2 - 2, v3) + v7);
} |
Thanks. The |
Decode(int) and Decode1(int) almost looks identical - I should check which difference there is... I don't know what they are yet. So it is almost done. RenormDe is largely this:
|
The last few routines are definitely looking like jbig - there are 4 large look-up tables /arrays from the jbig specification. |
Yes, you are right, the two functions resembles but there are still minor diffs. The main diffs lies in an else clause. I should say the |
Thanks. Only those two routines remain now, so this come at a good time. I went through the assembly listing and saw that about 7-8 instructions out of 170+ differ and is moved to a different place. So that are very similar. Also I found that they probably contains heavily inlined versions of I am reading ITU-81(?) the JBig spec, which contains the NLPS , NMPS, LSZ and SWITCH tables and a description of LpsExchange/MpsExchange. |
Yes, I think it's better to take a look back at ITU.82 (as I remember it). |
I finished writing and debugging a new JBigCodec (only the decoding part). It generates byte-wise identical output to libreaderex_x64.so. And "caj2pdf convert ..." can convert most HN files, so I think this almost can be closed. Except for these issues:
I think we can close this issue as it is getting rather long, and just continue with filing the items above as new issues. After having written a byte-wise compatible version of JBigCodec, I would say that the algorithm is exactly like ITU.82 whenever there is a flow diagram in the ITU document!!!! But details without a flow diagram is just done wrongly !!!! - the context (the CX / GetBit routines) and the init LNTP templates (LowestDecode) are just strange, and hugely wasteful... |
Finished and pushed the jbig2 decode routine out. So "caj2pdf convert ..." works on all HN variants, subjected to (1) you need to build two small shared libraries, single-file source provided (no more dependence on libreaderex_x64.so). (2) smaller images per pages are on their own separate pages. The jbig2 data is really standard, so just 30 lines of c++ code is needed to wrap around libpoppler to make it imitate libreaderex_x64.so's routines. I suppose if you want smaller dependency you can use jbig2dec the library, but it is less common (mupdf/mutool/ghostscript uses it bundled) than libpoppler, and perhaps more tedious to use. I think the branch can be merged and this closed, since we now can convert all the submitted HN files to pdf without using libreaderex_x64.so . @JeziL , @lelandyang let me know if you feel strongly about not merging, etc. |
I have updated the readme in the dev branch also, so when it is merged the readme would be updated with some HN info. |
Added libjbig2dec-based jbig decoding also, besides the libpoppler one; and updated README and docs elsewhere. |
Here is the converted result of that strange old HN file: Windows9x_NT操作系统的磁盘备份与恢复的研究与实现_张宗伟.pdf remaining issues are tracked in #57 . I just checked that Fedora (I use) ships mingw crossed-compiled libpoppler, so I can actually build the JBIG2 modules as a dll for windows too. I still don't use python on windows, but I can commit the dlls into the repo and one of you can edit the python code slightly to look for the windows dll under windows instead of the shared library for linux. |
I did the libpoppler based jbig2 dlls, but I don't like the result - it brings in gcc's c++ runtime dll (the equivalent of msvcr1xx / msvcp1xx from Microsoft) and a dozen libraries to about 60MB; so I switched to libjbig2dec for building the jbig2 dll. libpoppler is more common in Linux (every pdf reader on linux depends on it to read pdfs, except Adobe's and ghostscript/mupdf), while only ghostscript and mupdf uses libjbig2dec which just provides jbig2 decoding. Tested against windows python 3.5.x on wine and updating windows dll loading too - so it works on windows without needing to build shared libraries / dlls. Last chance!!! - if I don't hear from you guys @JeziL @lelandyang , I am merging HN-dev onto master in a couple of days. " ... convert ..." works on windows as is (I built and bundled the dlls), while on linux/mac instructions is provided on how to build those to shared libraries (and it is quite obvious when they are missing and fail to load). Following-up issues at #57 |
看来 HN 格式的图像数据可能是由 BMP 格式中的 DIB Header 及调色板和压缩像素数据组成。
DIB Header 格式:
调色板条目格式:
样本:
对照 DIB Header 的格式,可以发现 libreaderex.so 反编译源码中
CImage::DecodeJbig(int a1, int a2, int a3)
完成了对 DIB Header 的解析,并将相关参数传给了JBigCodec::Decode
函数:压缩像素数据如何解码仍有待研究。
The text was updated successfully, but these errors were encountered: