MIME checking was added

artiomn · Oct 5, 2022 · 3ce48dd · 3ce48dd
1 parent 0d5b56d
commit 3ce48dd
Show file tree

Hide file tree

Showing 7 changed files with 97 additions and 56 deletions.
diff --git a/README.md b/README.md
@@ -12,8 +12,8 @@ The Markdown Articles Tool is available for macOS, Windows, and Linux.
 
 Tool can be used:
 
-- To download markdown text with images with images and:
-  * Find all links to images, download images and fix links in the document.
+- To download Markdown documents with images and:
+  * Find all image links, download images and fix links in the document.
   * Can skip broken links.
   * Deduplicate similar images by content hash or using hash as a name.
 - Support images, linked with HTML `<img>` tag.
@@ -28,13 +28,19 @@ Also, if you want to use separate functions, you can just import the package.
 
 ## Changes
 
+### 0.0.8
+
 `-D` (deduplication) option was changed in the version 0.0.8. Now option is not boolean, it has several values: "disabled", "names_hashing", "content_hash".
   Long option name was changed too: now it's `deduplication-type`.
 
+### 0.1.2
 
-## Possibly bugs
-
-Deduplication can replace not similar images. Probability of this is very low, but it's possible. Will be fixed in the next version.
+- `-l, --process-local-images` deprecated from the version 0.1.2 and will not work: local images will always be processed.
+- Images with unrecognized MIME type will not be downloaded by default (use `-E` to disable this behaviour).
+- New option `-P, --prepend-images-with-path` changes image output path structure. If this option is enabled,
+  "remote" image path will be saved in the local directory structure.
+- Code was significantly refactored.
+- Some auto tests were added.
 
 
 ## Installation
@@ -61,38 +67,38 @@ pip3 install markdown-tool
 Syntax:
 
 ```
-markdown_tool [-h] [-D {disabled,names_hashing,content_hash}] [-d IMAGES_DIRNAME] [-a] [-s SKIP_LIST]
-              [-i {md,html,md+html,html+md}] [-l] [-n] [-o {md,html}] [-p IMAGES_PUBLIC_PATH] [-P] [-R]
-              [-t DOWNLOADING_TIMEOUT] [-O OUTPUT_PATH] [--verbose] [--version] article_file_path_or_url
+markdown_tool [options] <article_file_path_or_url>
 
 options:
   -h, --help            show this help message and exit
   -D {disabled,names_hashing,content_hash}, --deduplication-type {disabled,names_hashing,content_hash}
-                        Deduplicate images, using content hash or SHA1(image_name)
+                        Deduplicate images, using content hash or SHA1(image_name) (default: disabled)
   -d IMAGES_DIRNAME, --images-dirname IMAGES_DIRNAME
-                        Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url)
+                        Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url) (default: images)
   -a, --skip-all-incorrect
-                        skip all incorrect images
+                        skip all incorrect images (default: False)
+  -E, --download-incorrect-mime
+                        download "images" with unrecognized MIME type (default: False)
   -s SKIP_LIST, --skip-list SKIP_LIST
-                        skip URL's from the comma-separated list (or file with a leading '@')
+                        skip URL's from the comma-separated list (or file with a leading '@') (default: None)
   -i {md,html,md+html,html+md}, --input-format {md,html,md+html,html+md}
-                        input format
+                        input format (default: md)
   -l, --process-local-images
-                        [DEPRECATED] Process local images
+                        [DEPRECATED] Process local images (default: False)
   -n, --replace-image-names
-                        Replace image names, using content hash
+                        Replace image names, using content hash (default: False)
   -o {md,html}, --output-format {md,html}
-                        output format
+                        output format (default: md)
   -p IMAGES_PUBLIC_PATH, --images-public-path IMAGES_PUBLIC_PATH
                         Public path to the folder of downloaded images (possible variables: $article_name, $time, $date, $dt, $base_url)
   -P, --prepend-images-with-path
-                        Save relative images paths
-  -R, --remove-source   Remove or replace source file
+                        Save relative images paths (default: False)
+  -R, --remove-source   Remove or replace source file (default: False)
   -t DOWNLOADING_TIMEOUT, --downloading-timeout DOWNLOADING_TIMEOUT
-                        how many seconds to wait before downloading will be failed
+                        how many seconds to wait before downloading will be failed (default: -1)
   -O OUTPUT_PATH, --output-path OUTPUT_PATH
-                        article output file name
-  --verbose, -v         More verbose logging
+                        article output file name or path
+  --verbose, -v         More verbose logging (default: False)
   --version             return version number
 ```
 
@@ -119,10 +125,10 @@ find content/ -name "*.md" | xargs -n1 ./markdown_tool.py
 
 Tools is a pipeline, which get Markdown form the source and process them, using blocks:
 
-- Source to download article.
-- `ImageDownloader` to download every image.
-  Inside may be used image deduplicators blocks applied to the image.
+- Source download article.
+- `ImageDownloader` download every image.
+  Inside may be used image deduplicator blocks applied to the image.
 - Transform article file, i.e. fix images URLs.
 - Format article to the specific format (Markdown, HTML, PDF, etc.), using selected formatters.
 
-`ArticleProcessor` clas is a strategy, applies blocks, based on the parameters (from the CLI, for example).
+`ArticleProcessor` class is a strategy, applies blocks, based on the parameters (from the CLI, for example).
diff --git a/markdown_tool.py b/markdown_tool.py
@@ -60,6 +60,7 @@ def main(arguments):
                                  images_public_path=getattr(arguments, 'images_public_path', ''),
                                  input_formats=arguments.input_format.split('+'),
                                  skip_all_incorrect=arguments.skip_all_incorrect,
+                                 download_incorrect_mime=arguments.download_incorrect_mime,
                                  deduplication_type=getattr(DeduplicationVariant, arguments.deduplication_type.upper()),
                                  images_dirname=arguments.images_dirname)
 
@@ -72,7 +73,7 @@ def main(arguments):
 
     parser = argparse.ArgumentParser(
         prog='markdown_tool',
-        epilog='Use this at your own risk',
+        epilog='Use tool at your own risk!',
         description=f'{__doc__}Version: {__version__}',
         formatter_class=CustomArgumentDefaultsHelpFormatter
     )
@@ -85,6 +86,8 @@ def main(arguments):
                              '(possible variables: $article_name, $time, $date, $dt, $base_url)')
     parser.add_argument('-a', '--skip-all-incorrect', default=False, action='store_true',
                         help='skip all incorrect images')
+    parser.add_argument('-E', '--download-incorrect-mime', default=False, action='store_true',
+                        help='download "images" with unrecognized MIME type')
     parser.add_argument('-s', '--skip-list', default=None,
                         help='skip URL\'s from the comma-separated list (or file with a leading \'@\')')
     parser.add_argument('-i', '--input-format', default='md', choices=IN_FORMATS_LIST,

diff --git a/markdown_toolset/article_downloader.py b/markdown_toolset/article_downloader.py
@@ -2,7 +2,7 @@
 from pathlib import Path
 from time import strftime
 
-from .www_tools import is_url, download_from_url, get_filename_from_url, get_base_url, remove_protocol_prefix
+from .www_tools import is_url, download_from_url, get_filename_from_url, get_base_url
 
 
 class ArticleDownloader:
@@ -16,6 +16,7 @@ def __init__(self, article_url, output_path, article_formatter, downloading_time
         self._article_formatter = article_formatter
         self._downloading_timeout = downloading_timeout
         self._remove_source = remove_source
+        # TODO: Merge `article_path` and `article_out_path`.
         self._article_path = None
         self._article_out_path = None
 

diff --git a/markdown_toolset/article_processor.py b/markdown_toolset/article_processor.py
@@ -1,14 +1,12 @@
 import logging
-from enum import Enum
 from itertools import permutations
 from pathlib import Path
 from string import Template
 from time import strftime
 from typing import Union, List
 
 from .article_downloader import ArticleDownloader
-from .deduplicators.content_hash_dedup import ContentHashDeduplicator
-from .deduplicators.name_hash_dedup import NameHashDeduplicator
+from .deduplicators import DeduplicationVariant, select_deduplicator
 from .out_path_maker import OutPathMaker
 from .www_tools import remove_protocol_prefix
 from .image_downloader import ImageDownloader
@@ -20,18 +18,13 @@
 OUT_FORMATS_LIST = [f.format for f in FORMATTERS if f is not None]
 
 
-class DeduplicationVariant(Enum):
-    DISABLED = 0,
-    NAMES_HASHING = 1,
-    CONTENT_HASH = 2
-
-
 class ArticleProcessor:
     def __init__(self, article_file_path_or_url: str,
                  skip_list: Union[str, List[str]] = '', downloading_timeout: int = -1,
                  output_format: str = OUT_FORMATS_LIST[0], output_path: Union[Path, str] = Path.cwd(),
                  remove_source: bool = False, images_public_path: Union[Path, str] = '',
                  input_formats: List[str] = tuple(IN_FORMATS_LIST), skip_all_incorrect: bool = False,
+                 download_incorrect_mime: bool = False,
                  deduplication_type: DeduplicationVariant = DeduplicationVariant.DISABLED,
                  images_dirname: Union[Path, str] = 'images'):
         self._article_formatter = get_formatter(output_format, FORMATTERS)
@@ -44,11 +37,12 @@ def __init__(self, article_file_path_or_url: str,
         self._images_public_path = images_public_path
         self._input_formats = input_formats
         self._skip_all_incorrect = skip_all_incorrect
+        self._download_incorrect_mime = download_incorrect_mime
         self._deduplication_type = deduplication_type
         self._images_dirname = images_dirname
 
     def process(self):
-        skip_list = self._process_skip_list()
+        skip_list = self._process_skip_list_file()
         article_path, article_base_url, article_out_path = self._article_downloader.get_article()
 
         logging.info('File "%s" will be processed...', article_path)
@@ -67,14 +61,10 @@ def process(self):
         image_dir_name = Path(Template(self._images_dirname).safe_substitute(**variables))
         image_public_path = None if not image_public_path else Path(image_public_path)
 
-        deduplicator = None
-
-        if DeduplicationVariant.CONTENT_HASH == self._deduplication_type:
-            deduplicator = ContentHashDeduplicator(image_dir_name, image_public_path)
-        elif DeduplicationVariant.NAMES_HASHING == self._deduplication_type:
-            deduplicator = NameHashDeduplicator()
-        elif DeduplicationVariant.DISABLED == self._deduplication_type:
-            pass
+        if self._deduplication_type == DeduplicationVariant.CONTENT_HASH:
+            deduplicator = select_deduplicator(self._deduplication_type, image_dir_name, image_public_path)
+        else:
+            deduplicator = select_deduplicator(self._deduplication_type)
 
         out_path_maker = OutPathMaker(
             article_file_path=article_path,
@@ -87,14 +77,15 @@ def process(self):
             out_path_maker=out_path_maker,
             skip_list=skip_list,
             skip_all_errors=self._skip_all_incorrect,
+            download_incorrect_mime_types=self._download_incorrect_mime,
             downloading_timeout=self._downloading_timeout,
             deduplicator=deduplicator
         )
 
         result = transform_article(article_path, self._input_formats, TRANSFORMERS, img_downloader)
         format_article(article_out_path, result, self._article_formatter)
 
-    def _process_skip_list(self):
+    def _process_skip_list_file(self):
         skip_list = self._skip_list
 
         if isinstance(skip_list, str):

diff --git a/markdown_toolset/deduplicators/__init__.py b/markdown_toolset/deduplicators/__init__.py
@@ -0,0 +1,26 @@
+from enum import Enum
+
+from markdown_toolset.deduplicators.content_hash_dedup import ContentHashDeduplicator
+from markdown_toolset.deduplicators.name_hash_dedup import NameHashDeduplicator
+
+
+class DeduplicationVariant(Enum):
+    DISABLED = 0,
+    NAMES_HASHING = 1,
+    CONTENT_HASH = 2
+
+
+DEDUP_MAP = {
+    DeduplicationVariant.CONTENT_HASH: ContentHashDeduplicator,
+    DeduplicationVariant.NAMES_HASHING: NameHashDeduplicator,
+    DeduplicationVariant.DISABLED: None,
+}
+
+
+def select_deduplicator(deduplication_variant: DeduplicationVariant, *args, **kwargs):
+    dedup_class = DEDUP_MAP[deduplication_variant]
+
+    return dedup_class(*args, **kwargs) if dedup_class is not None else None
+
+
+__all__ = [DeduplicationVariant, select_deduplicator]
diff --git a/markdown_toolset/image_downloader.py b/markdown_toolset/image_downloader.py
@@ -1,5 +1,6 @@
 import logging
 import hashlib
+import mimetypes
 from pathlib import Path
 from typing import Optional, List
 
@@ -17,26 +18,24 @@ def __init__(self,
                  out_path_maker: OutPathMaker,
                  skip_list: Optional[List[str]] = None,
                  skip_all_errors: bool = False,
+                 download_incorrect_mime_types: bool = False,
                  downloading_timeout: float = -1,
                  deduplicator: Optional[Deduplicator] = None):
         """
-        :parameter article_path: path to the article file.
-        :parameter article_base_url: URL to download article.
+        :parameter out_path_maker: image local path creating strategy.
         :parameter skip_list: URLs of images to skip.
         :parameter skip_all_errors: if it's True, skip all errors and continue working.
-        :parameter img_dir_name: relative path of the directory where image files will be downloaded.
-        :parameter img_public_path: if set, will be used in the document instead of `img_dir_name`.
         :parameter downloading_timeout: if timeout =< 0 - infinite wait for the image downloading, otherwise wait for
                                         `downloading_timeout` seconds.
+        :parameter download_incorrect_mime_types: download images even if MIME type can't be identified.
         :parameter deduplicator: file deduplicator object.
-        :parameter process_local_images: if True, local image files will be processed.
         """
 
-        # TODO: rename parameters.
         self._out_path_maker = out_path_maker
         self._skip_list = set(skip_list) if skip_list is not None else []
         self._skip_all_errors = skip_all_errors
         self._downloading_timeout = downloading_timeout if downloading_timeout > 0 else None
+        self._download_incorrect_mime_types = download_incorrect_mime_types
         self._deduplicator = deduplicator
 
     def download_images(self, images: List[str]) -> dict:
@@ -62,15 +61,23 @@ def download_images(self, images: List[str]) -> dict:
             if not image_path_is_url:
                 logging.warning('Image %d ["%s"] probably has incorrect URL...', image_num + 1, image_url)
 
-                if self._out_path_maker._article_base_url:
-                    logging.debug('Trying to add base URL "%s"...', self._out_path_maker._article_base_url)
-                    image_download_url = f'{self._out_path_maker._article_base_url}/{image_url}'
+                if self._out_path_maker.article_base_url:
+                    logging.debug('Trying to add base URL "%s"...', self._out_path_maker.article_base_url)
+                    image_download_url = f'{self._out_path_maker.article_base_url}/{image_url}'
                 else:
-                    image_download_url = str(Path(self._out_path_maker._article_file_path).parent/image_url)
+                    image_download_url = str(Path(self._out_path_maker.article_file_path).parent/image_url)
             else:
                 image_download_url = image_url
 
             try:
+                mime_type, _ = mimetypes.guess_type(image_download_url)
+                logging.debug('"%s" MIME type = %s', image_download_url, mime_type)
+
+                if not self._download_incorrect_mime_types and mime_type is None:
+                    logging.warning('Image "%s" has incorrect MIME type and will not be downloaded!',
+                                    image_download_url)
+                    continue
+
                 image_filename, image_content = \
                     self._get_remote_image(image_download_url, image_num, images_count) if image_path_is_url \
                     else ImageDownloader._get_local_image(Path(image_download_url))

diff --git a/markdown_toolset/out_path_maker.py b/markdown_toolset/out_path_maker.py
@@ -15,6 +15,13 @@ def __init__(self, article_file_path: Path,
                  img_dir_name: Path = Path('images'),
                  img_public_path: Optional[Path] = None,
                  save_hierarchy: bool = False):
+        """
+        :parameter article_file_path: path to the article file.
+        :parameter article_base_url: URL to download article.
+        :parameter img_dir_name: relative path of the directory where image files will be downloaded.
+        :parameter img_public_path: if set, will be used in the document instead of `img_dir_name`.
+        :parameter save_hierarchy: if set, remote hierarchy will be used for the save image locally.
+        """
 
         logging.debug('Article file path = "%s", base URL = "%s"', article_file_path, article_base_url)