# azw3-or-mobi-to-cbz

Please read the [README](README.md) file for an introduction.

This notebook is split in a few sections:

* **Library imports and utility functions**
    * If you have trouble with the dependencies, take a look here.
    * In normal cases, you don't need to look here, because `pip install -r requirements.txt` is usually enough.
* **MOBI-TO-CBZ parsing and conversion code**
    * The main logic of this code. It extracts data from the e-book, does a bunch of sanity checks, and writes a CBZ file.
    * In normal cases, you don't need to look here.
    * If you want to understand what this code does, or if you want to debug why this code is not working with your e-book; then this is the place to look.
* **Running it against your files**
    * This is where you have to edit this notebook, to adapt to your files.

## Library imports and utility functions

In [None]:
# https://beautiful-soup-4.readthedocs.io/
from bs4 import BeautifulSoup

In [None]:
# https://github.com/iscc/mobi/blob/master/README.md
import mobi

In [None]:
# https://lxml.de/tutorial.html#the-e-factory
from lxml import etree
from lxml.builder import ElementMaker

In [None]:
# https://tqdm.github.io/
from tqdm.notebook import tqdm

In [None]:
import json
import re
import shutil
from collections import Counter, defaultdict, namedtuple
from dataclasses import asdict, dataclass, field
from datetime import date
from glob import glob
from itertools import chain
from pathlib import Path
from zipfile import ZipFile

In [None]:
# https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists
# Returns a list filtering out any duplicate items.
# If you don't care about the order, don't use this function. Instead, just use the built-in set() type.
def uniq(sequence):
    return list(dict.fromkeys(sequence))

## MOBI-to-CBZ parsing and conversion code

In [None]:
# https://github.com/anansi-project/comicinfo
# This function reads the `…/mobi8/OEBPS/content.opf` metadata file
# and returns a nice `comicinfo.xml` output.
# It also returns whether this book is fixed layout (pre-paginated).
# It also returns the cover image filename.
def comicinfo_converter(content_opf_filename: Path):
    with open(content_opf_filename) as f:
        content_opf = f.read().replace('<!-- BEGIN INFORMATION ONLY', '').replace('END INFORMATION ONLY -->', '')
    doc = BeautifulSoup(content_opf, 'xml')
    metadata = doc.metadata

    def metatext(*args, **kwargs):
        if el := metadata.find(*args, **kwargs):
            if el.name == 'meta' and 'content' in el.attrs:
                # Used for `<meta name="…" content="…" />` tags
                return el['content']
            else:
                # Used for `<meta property="…">…</meta>` tags
                # Used for `<dc:foobar>…</dc:foobar` tags
                return el.string.strip()
        else:
            return None

    datestr = metatext('dc:date')
    dateobj = None
    if datestr:
        try:
            date.fromisoformat(datestr)
        except ValueError:
            pass

    fixed_layout = metatext('meta', attrs={'name': 'fixed-layout'})
    rendition_layout = metatext('meta', property='rendition:layout')

    E = ElementMaker()
    comicinfo = E.ComicInfo(
        # Either this title, or the one from <meta name="Updated_Title" content="…" />
        E.Title(metatext('dc:title') or ''),
        E.Summary(metatext('dc:description') or ''),
        *(
            [
                E.Year(str(dateobj.year)),
                E.Month(str(dateobj.month)),
                E.Day(str(dateobj.day)),
            ] if dateobj else []
        ),
        E.Publisher(metatext('dc:publisher') or ''),
        E.LanguageISO(metatext('dc:language') or ''),

        # Mapping "Creator" to "Writer", and hoping it's a good enough match.
        E.Writer(metatext('dc:creator') or ''),
        # E.Penciller(),
        # E.Inker(),
        # E.Colorist(),
        # E.Letterer(),
        # E.CoverArtist(),
        # E.Editor(),

        # E.Series(),
        # E.Number(),
        # E.Count(),
        # E.Volume(),
        # E.AlternateSeries(),
        # E.AlternateNumber(),
        # E.AlternateCount(),
        # E.Notes(),
        # E.Imprint(),
        # E.Genre(),
        # E.Web(),
        # E.PageCount(),
        # E.Format(),
        # E.BlackAndWhite(),
        # E.Manga(),
        # E.Characters(),
        # E.Teams(),
        # E.Locations(),
        # E.ScanInformation(),
        # E.StoryArc(),
        # E.SeriesGroup(),
        # E.AgeRating(),
        # E.MainCharacterOrTeam(),
        # E.Review(),
    )

    guide = doc.guide
    # There is also type="other.ms-coverimage", which is ignored here.
    # There is also type="other.ms-coverimage-standard", which is ignored here.
    ref = guide.find('reference', type='cover')
    cover = ref.href if ref else None
    # Even though we try to extract the cover here, it ends up unused in the rest of the code.

    return (
        etree.tostring(comicinfo, xml_declaration=True, encoding='utf-8'),
        fixed_layout,
        rendition_layout,
        cover,
    )

In [None]:
def extract_urls_from_css(csspath: Path):
    try:
        text = csspath.read_text()
        text = re.sub(r'@font-face\s*{[^}]*}', '', text)
        matches = re.findall(r'''url\(['"]?([^)]+)['"]?\)''', text)
        return [csspath.parent / url for url in matches if url != 'XXXXXXXXXXXXXXXX']
    except FileNotFoundError as e:
        if csspath.name == 'XXXXXXXXXXXXXXXX':
            # Ignore bogus href.
            return []
        else:
            raise

In [None]:
def extract_urls_from_svg(svg_element, basedir: Path):
    for image in svg_element.find_all('image'):
        href = image['xlink:href']
        if href:
            yield basedir / href

In [None]:
def parse_single_page_from_extracted_mobi(fname, basedir):
    images = []
    messages = []

    with open(fname) as f:
        doc = BeautifulSoup(f, 'html.parser')

    # Capturing images from this page:
    # Images linked from the CSS stylesheet.
    uniq_css_href = uniq(link_element['href'] for link_element in doc.find_all('link', rel='stylesheet'))
    from_css = uniq(chain.from_iterable(extract_urls_from_css(fname.parent / href) for href in uniq_css_href if href))
    # if from_css:
    #     pass
    #     # print(bookpath.name, fname.relative_to(basedir), 'link stylesheets pointing to:', from_css)
    # Images linked from an inline SVG element. Usually for the book cover.
    from_svg = uniq(chain.from_iterable(extract_urls_from_svg(svg, fname.parent) for svg in doc.body.find_all('svg')))
    # if from_svg:
    #     pass
    #     # print(bookpath.name, fname.relative_to(basedir), 'svg pointing to:', from_svg)
    # Images included via the good old `<img src="…" />` tag.
    from_img = uniq(fname.parent / img['src'] for img in doc.body.find_all('img'))
    # if from_img:
    #     pass
    #     # print(bookpath.name, fname.relative_to(basedir), 'img pointing to:', from_img)
    images = uniq(chain(from_css, from_svg, from_img))
    if len(images) == 2:
        # Doing some sanity checks.
        left = doc.select('.leftPage#page-img-left')
        right = doc.select('.rightPage#page-img-right')
        idGen = doc.select('div[id^="_idContainer"] > img._idGenObjectAttribute-1._idGenObjectAttribute-2')
        if len(left) == len(right) == 1:
            # Easy and straigh-forward markup.
            pass
        elif len(idGen) == 2:
            # This is a more convoluted markup, but also easy enough.
            pass
        else:
            # This is a different markup than everything else.
            messages.append(
                'WARNING page {} has two images in a single page, but using some unique markup.\n{}'.format(
                    fname.relative_to(basedir),
                    '\n'.join(str(p) for p in images),
                )
            )

        # This is a two-page per XHTML document.
        # Let's just assume the images are listed in the correct order in the CSS.
        # For the few books I manually inspected, this assumption is correct.
        # This logic will fail if the CSS lists the right page before the left one.
        if images != sorted(images):
            messages.append(
                'WARNING page {} has two images in this page, but in reverse order.\n{}'.format(
                    fname.relative_to(basedir),
                    '\n'.join(str(p) for p in images),
                )
            )
    elif len(images) > 2:
        messages.append(
            'ERROR too many images in a single page: {} has {} images.'.format(
                fname.relative_to(basedir),
                len(images),
            )
        )

    # Checking for text inside the page.
    body = doc.body
    text = ' '.join(body.stripped_strings)
    if text == "":
        if len(images) == 0:
            messages.append(
                'WARNING page {} has no text and no images.'.format(fname.relative_to(basedir))
            )
    else:
        # Some books have "invisible" text overlaid on top of the image.
        # Those are usuallly clickable links.
        messages.append('TEXT {} {}'.format(fname.relative_to(basedir), text))

    return images, messages, len(text)

In [None]:
@dataclass
class BookResults:
    path: str
    cbzpath: str = ''
    converted: bool = False
    page_count: int = 0
    image_count: int = 0
    char_count: int = 0
    max_chars_per_page: int = 0
    errors: int = 0
    warnings: int = 0
    messages: list[str] = field(default_factory=list)

    # Why having this method instead of having `.errors` and `.warnings` as getters?
    # Because only dataclass fields get converted to key/value in `asdict()` function.
    def count_messages(self):
        self.errors = 0
        self.warnings = 0
        for m in self.messages:
            if m.startswith('ERR'):
                self.errors += 1
            elif m.startswith('WARN'):
                self.warnings += 1

In [None]:
def convert_mobi_to_cbz(
    bookpath: Path,
    *,
    keep_tempdir_because_i_am_debugging: bool = False,
    exclude_images: set = None,
    exclude_duplicates: bool = True,
) -> BookResults:
    '''Given a MOBI or AZW3 filepath, writes a CBZ file and returns a BookResults instance.

    Parameters:
    bookpath - Must be a Path instance. If you have a string, please convert it to Path before passing to this function.
    keep_tempdir_because_i_am_debugging - Boolean. Enable it to keep the temporary directory (instead of automatically deleting it).
    exclude_images - Set. These image names will be excluded from the CBZ.
    exclude_duplicates - Boolean. Don't add duplicate images to the CBZ.
    '''
    ret = BookResults(path=str(bookpath))
    try:
        tempdir, filepath = mobi.extract(str(bookpath))
    except Exception as e:
        ret.messages.append('ERROR cannot extract the book: {}'.format(e))
        ret.count_messages()
        return ret
        
    tempdir = Path(tempdir)
    # Verbose progress output:
    print('Processing {!r} using temp directory {!r}'.format(bookpath, tempdir))

    try:
        comicinfo_xml, fixed_layout, rendition_layout, cover = comicinfo_converter(tempdir / 'mobi8/OEBPS/content.opf')
        if fixed_layout is None and rendition_layout is None:
            # Non-fixed layout ebook. This isn't a comic book. Nothing to do here.
            ret.messages.append('SKIPPED non-fixed layout, this is not a comic book.')
            ret.count_messages()
            return ret
        elif fixed_layout == 'false' and rendition_layout == 'reflowable':
            # Non-fixed layout ebook. This isn't a comic book. Nothing to do here.
            ret.messages.append('SKIPPED non-fixed layout, this is not a comic book.')
            ret.count_messages()
            return ret
        elif fixed_layout == 'true' and rendition_layout == 'pre-paginated':
            pass
        else:
            # Sanity check.
            ret.messages.append('ERROR unsupported values for fixed_layout={!r} and rendition_layout={!r}.'.format(fixed_layout, rendition_layout))
            ret.count_messages()
            return ret

        all_images = []

        # The filenames are:
        # * cover_page.xhtml
        # * nav.xhtml
        # * part0000.xhtml … part9999.xhtml
        for fname in sorted(tempdir.glob('**/*.xhtml')):
            if fname.name == 'nav.xhtml':
                continue

            page_images, page_messages, page_chars = parse_single_page_from_extracted_mobi(fname, tempdir)

            all_images.extend(page_images)
            ret.page_count += 1
            ret.messages.extend(page_messages)
            ret.char_count += page_chars
            ret.max_chars_per_page = max(ret.max_chars_per_page, page_chars)

        # Removing files that cannot be found.
        # Likely due to broken CSS. (e.g. non-existing background image)
        # Also removing images explicitly excluded by the caller.
        # Also resolving the paths, removing the `..` parts. This is needed in order to detect duplicates.
        tmp = []
        for img in all_images:
            if exclude_images and img.name in exclude_images:
                pass
            elif not img.is_file():
                ret.messages.append('WARNING image not found {}'.format(img))
            else:
                tmp.append(img.resolve())
        all_images = tmp

        ret.image_count = len(all_images)
        if ret.image_count == 0:
            ret.messages.append('SKIPPED no images found in this book.')
            ret.count_messages()
            return ret

        # Checking for duplicate images across multiple pages.
        if len(all_images) != len(set(all_images)):
            cnt = Counter(all_images)
            if exclude_duplicates:
                ret.messages.append('WARNING duplicate images across pages.')
                all_images = uniq(all_images)
            else:
                ret.messages.append('ERROR duplicate images across pages.')
            ret.messages.extend(['DUPLICATE {}x {}'.format(v, k) for k, v in cnt.items() if v > 1])

        if all_images != sorted(all_images):
            ret.messages.append('WARNING image filenames are not in the alphabetical order')

        # All sanity checks passed.

        ret.cbzpath = str(bookpath.with_suffix('.cbz'))
        with ZipFile(ret.cbzpath, 'w') as cbz:
            cbz.writestr('comicinfo.xml', comicinfo_xml)
            for i, img in enumerate(all_images):
                ext = img.suffix.replace('jpeg', 'jpg')
                cbz.write(img, 'page{:03}{}'.format(i + 1, ext))

        ret.converted = True
        ret.count_messages()
        return ret

    finally:
        if not keep_tempdir_because_i_am_debugging:
            shutil.rmtree(tempdir)

## Running it against your files

You have several ways to run this code.

Let's start by some examples on how to run it for a single file. Let's say that file is called `BOOK`:

In [None]:
BOOK = Path("/home/foobar/Books/Example Comic Book.azw3")

Note: Please ignore this message:

> Warning: Bad key, size, value combination detected in EXTH  406 16 0000000000000000

This warning is coming from the `mobi` library, and it is harmless. Unfortunately, there is no option to stop printing this warning.

You can convert it to CBZ using the default options:

In [None]:
out = convert_mobi_to_cbz(BOOK)
print(json.dumps(asdict(out), indent=2))

For most cases, you probably want to skip duplicate images.

(Because the code here is very simplified and takes shortcuts, it can detect the same image across multiple pages, even if that image isn't shown on such pages. Learn more about how the code works by reading the README and also reading the actual code.)

Still, for in some cases you may want to keep duplicate images for your specific book, and you can do that:

In [None]:
out = convert_mobi_to_cbz(BOOK, exclude_duplicates=False)
print(json.dumps(asdict(out), indent=2))

In some cases, there may be some decorative images that are being incorrectly added to the CBZ. You can easily exclude them, in a case-by-case manner:

In [None]:
out = convert_mobi_to_cbz(BOOK, exclude_images={'image00015.gif', 'image00023.jpeg'})
print(json.dumps(asdict(out), indent=2))

Sometimes you need to debug what is going on. That's also easy to do, just remember to manually delete the temporary directory after you're finished:

In [None]:
out = convert_mobi_to_cbz(BOOK, keep_tempdir_because_i_am_debugging=True)
print(json.dumps(asdict(out), indent=2))

Finally, you may have a directory full of e-books. You may want to convert them all to CBZ. Well, not all of them, but just those that have a fixed layout. And you may also want to save some statistics/diagnostics log as a JSON file for later inspection.

This is also easy to do:

In [None]:
BASEDIR = Path('/home/foobar/Books/')
with open(BASEDIR / '_MOBI_TO_CBZ_LOG.json', 'w') as logfile:
    json.dump([
        asdict(convert_mobi_to_cbz(book))
        for book in tqdm(sorted(BASEDIR.glob('*.azw*')))
    ], logfile, indent=2)