Skip to content

Commit

Permalink
MIME checking was added
Browse files Browse the repository at this point in the history
  • Loading branch information
Artiom N committed Oct 5, 2022
1 parent 0d5b56d commit 3ce48dd
Show file tree
Hide file tree
Showing 7 changed files with 97 additions and 56 deletions.
56 changes: 31 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ The Markdown Articles Tool is available for macOS, Windows, and Linux.

Tool can be used:

- To download markdown text with images with images and:
* Find all links to images, download images and fix links in the document.
- To download Markdown documents with images and:
* Find all image links, download images and fix links in the document.
* Can skip broken links.
* Deduplicate similar images by content hash or using hash as a name.
- Support images, linked with HTML `<img>` tag.
Expand All @@ -28,13 +28,19 @@ Also, if you want to use separate functions, you can just import the package.

## Changes

### 0.0.8

`-D` (deduplication) option was changed in the version 0.0.8. Now option is not boolean, it has several values: "disabled", "names_hashing", "content_hash".
Long option name was changed too: now it's `deduplication-type`.

### 0.1.2

## Possibly bugs

Deduplication can replace not similar images. Probability of this is very low, but it's possible. Will be fixed in the next version.
- `-l, --process-local-images` deprecated from the version 0.1.2 and will not work: local images will always be processed.
- Images with unrecognized MIME type will not be downloaded by default (use `-E` to disable this behaviour).
- New option `-P, --prepend-images-with-path` changes image output path structure. If this option is enabled,
"remote" image path will be saved in the local directory structure.
- Code was significantly refactored.
- Some auto tests were added.


## Installation
Expand All @@ -61,38 +67,38 @@ pip3 install markdown-tool
Syntax:

```
markdown_tool [-h] [-D {disabled,names_hashing,content_hash}] [-d IMAGES_DIRNAME] [-a] [-s SKIP_LIST]
[-i {md,html,md+html,html+md}] [-l] [-n] [-o {md,html}] [-p IMAGES_PUBLIC_PATH] [-P] [-R]
[-t DOWNLOADING_TIMEOUT] [-O OUTPUT_PATH] [--verbose] [--version] article_file_path_or_url
markdown_tool [options] <article_file_path_or_url>
options:
-h, --help show this help message and exit
-D {disabled,names_hashing,content_hash}, --deduplication-type {disabled,names_hashing,content_hash}
Deduplicate images, using content hash or SHA1(image_name)
Deduplicate images, using content hash or SHA1(image_name) (default: disabled)
-d IMAGES_DIRNAME, --images-dirname IMAGES_DIRNAME
Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url)
Folder in which to download images (possible variables: $article_name, $time, $date, $dt, $base_url) (default: images)
-a, --skip-all-incorrect
skip all incorrect images
skip all incorrect images (default: False)
-E, --download-incorrect-mime
download "images" with unrecognized MIME type (default: False)
-s SKIP_LIST, --skip-list SKIP_LIST
skip URL's from the comma-separated list (or file with a leading '@')
skip URL's from the comma-separated list (or file with a leading '@') (default: None)
-i {md,html,md+html,html+md}, --input-format {md,html,md+html,html+md}
input format
input format (default: md)
-l, --process-local-images
[DEPRECATED] Process local images
[DEPRECATED] Process local images (default: False)
-n, --replace-image-names
Replace image names, using content hash
Replace image names, using content hash (default: False)
-o {md,html}, --output-format {md,html}
output format
output format (default: md)
-p IMAGES_PUBLIC_PATH, --images-public-path IMAGES_PUBLIC_PATH
Public path to the folder of downloaded images (possible variables: $article_name, $time, $date, $dt, $base_url)
-P, --prepend-images-with-path
Save relative images paths
-R, --remove-source Remove or replace source file
Save relative images paths (default: False)
-R, --remove-source Remove or replace source file (default: False)
-t DOWNLOADING_TIMEOUT, --downloading-timeout DOWNLOADING_TIMEOUT
how many seconds to wait before downloading will be failed
how many seconds to wait before downloading will be failed (default: -1)
-O OUTPUT_PATH, --output-path OUTPUT_PATH
article output file name
--verbose, -v More verbose logging
article output file name or path
--verbose, -v More verbose logging (default: False)
--version return version number
```

Expand All @@ -119,10 +125,10 @@ find content/ -name "*.md" | xargs -n1 ./markdown_tool.py

Tools is a pipeline, which get Markdown form the source and process them, using blocks:

- Source to download article.
- `ImageDownloader` to download every image.
Inside may be used image deduplicators blocks applied to the image.
- Source download article.
- `ImageDownloader` download every image.
Inside may be used image deduplicator blocks applied to the image.
- Transform article file, i.e. fix images URLs.
- Format article to the specific format (Markdown, HTML, PDF, etc.), using selected formatters.

`ArticleProcessor` clas is a strategy, applies blocks, based on the parameters (from the CLI, for example).
`ArticleProcessor` class is a strategy, applies blocks, based on the parameters (from the CLI, for example).
5 changes: 4 additions & 1 deletion markdown_tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ def main(arguments):
images_public_path=getattr(arguments, 'images_public_path', ''),
input_formats=arguments.input_format.split('+'),
skip_all_incorrect=arguments.skip_all_incorrect,
download_incorrect_mime=arguments.download_incorrect_mime,
deduplication_type=getattr(DeduplicationVariant, arguments.deduplication_type.upper()),
images_dirname=arguments.images_dirname)

Expand All @@ -72,7 +73,7 @@ def main(arguments):

parser = argparse.ArgumentParser(
prog='markdown_tool',
epilog='Use this at your own risk',
epilog='Use tool at your own risk!',
description=f'{__doc__}Version: {__version__}',
formatter_class=CustomArgumentDefaultsHelpFormatter
)
Expand All @@ -85,6 +86,8 @@ def main(arguments):
'(possible variables: $article_name, $time, $date, $dt, $base_url)')
parser.add_argument('-a', '--skip-all-incorrect', default=False, action='store_true',
help='skip all incorrect images')
parser.add_argument('-E', '--download-incorrect-mime', default=False, action='store_true',
help='download "images" with unrecognized MIME type')
parser.add_argument('-s', '--skip-list', default=None,
help='skip URL\'s from the comma-separated list (or file with a leading \'@\')')
parser.add_argument('-i', '--input-format', default='md', choices=IN_FORMATS_LIST,
Expand Down
3 changes: 2 additions & 1 deletion markdown_toolset/article_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from pathlib import Path
from time import strftime

from .www_tools import is_url, download_from_url, get_filename_from_url, get_base_url, remove_protocol_prefix
from .www_tools import is_url, download_from_url, get_filename_from_url, get_base_url


class ArticleDownloader:
Expand All @@ -16,6 +16,7 @@ def __init__(self, article_url, output_path, article_formatter, downloading_time
self._article_formatter = article_formatter
self._downloading_timeout = downloading_timeout
self._remove_source = remove_source
# TODO: Merge `article_path` and `article_out_path`.
self._article_path = None
self._article_out_path = None

Expand Down
29 changes: 10 additions & 19 deletions markdown_toolset/article_processor.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
import logging
from enum import Enum
from itertools import permutations
from pathlib import Path
from string import Template
from time import strftime
from typing import Union, List

from .article_downloader import ArticleDownloader
from .deduplicators.content_hash_dedup import ContentHashDeduplicator
from .deduplicators.name_hash_dedup import NameHashDeduplicator
from .deduplicators import DeduplicationVariant, select_deduplicator
from .out_path_maker import OutPathMaker
from .www_tools import remove_protocol_prefix
from .image_downloader import ImageDownloader
Expand All @@ -20,18 +18,13 @@
OUT_FORMATS_LIST = [f.format for f in FORMATTERS if f is not None]


class DeduplicationVariant(Enum):
DISABLED = 0,
NAMES_HASHING = 1,
CONTENT_HASH = 2


class ArticleProcessor:
def __init__(self, article_file_path_or_url: str,
skip_list: Union[str, List[str]] = '', downloading_timeout: int = -1,
output_format: str = OUT_FORMATS_LIST[0], output_path: Union[Path, str] = Path.cwd(),
remove_source: bool = False, images_public_path: Union[Path, str] = '',
input_formats: List[str] = tuple(IN_FORMATS_LIST), skip_all_incorrect: bool = False,
download_incorrect_mime: bool = False,
deduplication_type: DeduplicationVariant = DeduplicationVariant.DISABLED,
images_dirname: Union[Path, str] = 'images'):
self._article_formatter = get_formatter(output_format, FORMATTERS)
Expand All @@ -44,11 +37,12 @@ def __init__(self, article_file_path_or_url: str,
self._images_public_path = images_public_path
self._input_formats = input_formats
self._skip_all_incorrect = skip_all_incorrect
self._download_incorrect_mime = download_incorrect_mime
self._deduplication_type = deduplication_type
self._images_dirname = images_dirname

def process(self):
skip_list = self._process_skip_list()
skip_list = self._process_skip_list_file()
article_path, article_base_url, article_out_path = self._article_downloader.get_article()

logging.info('File "%s" will be processed...', article_path)
Expand All @@ -67,14 +61,10 @@ def process(self):
image_dir_name = Path(Template(self._images_dirname).safe_substitute(**variables))
image_public_path = None if not image_public_path else Path(image_public_path)

deduplicator = None

if DeduplicationVariant.CONTENT_HASH == self._deduplication_type:
deduplicator = ContentHashDeduplicator(image_dir_name, image_public_path)
elif DeduplicationVariant.NAMES_HASHING == self._deduplication_type:
deduplicator = NameHashDeduplicator()
elif DeduplicationVariant.DISABLED == self._deduplication_type:
pass
if self._deduplication_type == DeduplicationVariant.CONTENT_HASH:
deduplicator = select_deduplicator(self._deduplication_type, image_dir_name, image_public_path)
else:
deduplicator = select_deduplicator(self._deduplication_type)

out_path_maker = OutPathMaker(
article_file_path=article_path,
Expand All @@ -87,14 +77,15 @@ def process(self):
out_path_maker=out_path_maker,
skip_list=skip_list,
skip_all_errors=self._skip_all_incorrect,
download_incorrect_mime_types=self._download_incorrect_mime,
downloading_timeout=self._downloading_timeout,
deduplicator=deduplicator
)

result = transform_article(article_path, self._input_formats, TRANSFORMERS, img_downloader)
format_article(article_out_path, result, self._article_formatter)

def _process_skip_list(self):
def _process_skip_list_file(self):
skip_list = self._skip_list

if isinstance(skip_list, str):
Expand Down
26 changes: 26 additions & 0 deletions markdown_toolset/deduplicators/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from enum import Enum

from markdown_toolset.deduplicators.content_hash_dedup import ContentHashDeduplicator
from markdown_toolset.deduplicators.name_hash_dedup import NameHashDeduplicator


class DeduplicationVariant(Enum):
DISABLED = 0,
NAMES_HASHING = 1,
CONTENT_HASH = 2


DEDUP_MAP = {
DeduplicationVariant.CONTENT_HASH: ContentHashDeduplicator,
DeduplicationVariant.NAMES_HASHING: NameHashDeduplicator,
DeduplicationVariant.DISABLED: None,
}


def select_deduplicator(deduplication_variant: DeduplicationVariant, *args, **kwargs):
dedup_class = DEDUP_MAP[deduplication_variant]

return dedup_class(*args, **kwargs) if dedup_class is not None else None


__all__ = [DeduplicationVariant, select_deduplicator]
27 changes: 17 additions & 10 deletions markdown_toolset/image_downloader.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
import hashlib
import mimetypes
from pathlib import Path
from typing import Optional, List

Expand All @@ -17,26 +18,24 @@ def __init__(self,
out_path_maker: OutPathMaker,
skip_list: Optional[List[str]] = None,
skip_all_errors: bool = False,
download_incorrect_mime_types: bool = False,
downloading_timeout: float = -1,
deduplicator: Optional[Deduplicator] = None):
"""
:parameter article_path: path to the article file.
:parameter article_base_url: URL to download article.
:parameter out_path_maker: image local path creating strategy.
:parameter skip_list: URLs of images to skip.
:parameter skip_all_errors: if it's True, skip all errors and continue working.
:parameter img_dir_name: relative path of the directory where image files will be downloaded.
:parameter img_public_path: if set, will be used in the document instead of `img_dir_name`.
:parameter downloading_timeout: if timeout =< 0 - infinite wait for the image downloading, otherwise wait for
`downloading_timeout` seconds.
:parameter download_incorrect_mime_types: download images even if MIME type can't be identified.
:parameter deduplicator: file deduplicator object.
:parameter process_local_images: if True, local image files will be processed.
"""

# TODO: rename parameters.
self._out_path_maker = out_path_maker
self._skip_list = set(skip_list) if skip_list is not None else []
self._skip_all_errors = skip_all_errors
self._downloading_timeout = downloading_timeout if downloading_timeout > 0 else None
self._download_incorrect_mime_types = download_incorrect_mime_types
self._deduplicator = deduplicator

def download_images(self, images: List[str]) -> dict:
Expand All @@ -62,15 +61,23 @@ def download_images(self, images: List[str]) -> dict:
if not image_path_is_url:
logging.warning('Image %d ["%s"] probably has incorrect URL...', image_num + 1, image_url)

if self._out_path_maker._article_base_url:
logging.debug('Trying to add base URL "%s"...', self._out_path_maker._article_base_url)
image_download_url = f'{self._out_path_maker._article_base_url}/{image_url}'
if self._out_path_maker.article_base_url:
logging.debug('Trying to add base URL "%s"...', self._out_path_maker.article_base_url)
image_download_url = f'{self._out_path_maker.article_base_url}/{image_url}'
else:
image_download_url = str(Path(self._out_path_maker._article_file_path).parent/image_url)
image_download_url = str(Path(self._out_path_maker.article_file_path).parent/image_url)
else:
image_download_url = image_url

try:
mime_type, _ = mimetypes.guess_type(image_download_url)
logging.debug('"%s" MIME type = %s', image_download_url, mime_type)

if not self._download_incorrect_mime_types and mime_type is None:
logging.warning('Image "%s" has incorrect MIME type and will not be downloaded!',
image_download_url)
continue

image_filename, image_content = \
self._get_remote_image(image_download_url, image_num, images_count) if image_path_is_url \
else ImageDownloader._get_local_image(Path(image_download_url))
Expand Down
7 changes: 7 additions & 0 deletions markdown_toolset/out_path_maker.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ def __init__(self, article_file_path: Path,
img_dir_name: Path = Path('images'),
img_public_path: Optional[Path] = None,
save_hierarchy: bool = False):
"""
:parameter article_file_path: path to the article file.
:parameter article_base_url: URL to download article.
:parameter img_dir_name: relative path of the directory where image files will be downloaded.
:parameter img_public_path: if set, will be used in the document instead of `img_dir_name`.
:parameter save_hierarchy: if set, remote hierarchy will be used for the save image locally.
"""

logging.debug('Article file path = "%s", base URL = "%s"', article_file_path, article_base_url)

Expand Down

0 comments on commit 3ce48dd

Please sign in to comment.