Skip to content
Permalink
Browse files
Extendable metadata extractors (#2830) (#2861)
* Rename UNSLUGIFY_TITLES → FILE_METADATA_UNSLUGIFY_TITLES

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Initial scaffolding for metadata extractor plugins

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Use correct environment marker syntax

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Style fixes

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Use metadata extractor APIs and plugins

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Fix galleries; add basic split support (temporary)

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Fix compatibility with .meta files

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* #2856 was effectively undone

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Make metadata splitting work

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Code cleanups

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Fix tests, add new config_present condition

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Allow whitespace before Nikola-style metadata

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Fix RSS tests

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Add MetadataExtractors documentation and `override` MetaPriority

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Style fixes

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Add tests for metadata extractors

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Turns out we run flake8 on tests, too

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Make write_metadata work with metadata extractors

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Fix #2830 — add MetadataExtractor plugins

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Minor documentation fixes [ci skip]

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Address review by @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Remove useless comment [ci skip]

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Slightly smarter return values for split_metadata_for_text

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Pass site when writing metadata in importers

Note this is a minor API change (removes @staticmethod) that needs to be
reflected in some importers.

cc @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>

* Document potentially confusing split_metadata_from_text behavior

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
  • Loading branch information
Kwpolska authored and ralsina committed Jul 9, 2017
1 parent 481ce20 commit dc47ce9189ac1295596da1d88722de78a8b28534
Showing with 832 additions and 308 deletions.
  1. +4 −3 CHANGES.txt
  2. +25 −2 docs/extending.txt
  3. +4 −2 docs/manual.txt
  4. +4 −5 nikola/conf.py.in
  5. +258 −0 nikola/metadata_extractors.py
  6. +21 −1 nikola/nikola.py
  7. +84 −9 nikola/plugin_categories.py
  8. +2 −3 nikola/plugins/basic_import.py
  9. +1 −1 nikola/plugins/command/new_post.py
  10. +3 −4 nikola/plugins/compile/html.py
  11. +3 −2 nikola/plugins/compile/ipynb.py
  12. +6 −11 nikola/plugins/compile/markdown/__init__.py
  13. +1 −3 nikola/plugins/compile/pandoc.py
  14. +1 −3 nikola/plugins/compile/php.py
  15. +6 −9 nikola/plugins/compile/rest/__init__.py
  16. +2 −1 nikola/plugins/misc/scan_posts.py
  17. +3 −1 nikola/plugins/task/galleries.py
  18. +66 −82 nikola/post.py
  19. +40 −140 nikola/utils.py
  20. +1 −1 setup.py
  21. +7 −0 tests/data/metadata_extractors/f-html-1-compiler.html
  22. +31 −0 tests/data/metadata_extractors/f-ipynb-1-compiler.ipynb
  23. +7 −0 tests/data/metadata_extractors/f-markdown-1-compiler.md
  24. +13 −0 tests/data/metadata_extractors/f-markdown-1-nikola.md
  25. +2 −0 tests/data/metadata_extractors/f-markdown-2-nikola.md
  26. +7 −0 tests/data/metadata_extractors/f-markdown-2-nikola.meta
  27. +9 −0 tests/data/metadata_extractors/f-rest-1-compiler.rst
  28. +11 −0 tests/data/metadata_extractors/f-rest-1-nikola.rst
  29. +8 −0 tests/data/metadata_extractors/f-rest-1-toml.rst
  30. +8 −0 tests/data/metadata_extractors/f-rest-1-yaml.rst
  31. +7 −0 tests/data/metadata_extractors/f-rest-2-nikola.meta
  32. +2 −0 tests/data/metadata_extractors/f-rest-2-nikola.rst
  33. +6 −0 tests/data/metadata_extractors/f-rest-2-toml.meta
  34. +2 −0 tests/data/metadata_extractors/f-rest-2-toml.rst
  35. +6 −0 tests/data/metadata_extractors/f-rest-2-yaml.meta
  36. +2 −0 tests/data/metadata_extractors/f-rest-2-yaml.rst
  37. +150 −0 tests/test_metadata_extractors.py
  38. +2 −2 tests/test_rss_feeds.py
  39. +17 −23 tests/test_utils.py
@@ -4,13 +4,14 @@ New in master
Features
--------

* Add support for ``MetadataExtractor`` plugins that allow custom,
extensible metadata extraction from posts (Issue #2830)
* Support YAML and TOML metadata in 2-file posts (via Issue #2830)
* Renamed ``UNSLUGIFY_TITLES````FILE_METADATA_UNSLUGIFY_TITLES`` (Issue #2840)
* Add ``NIKOLA_SHOW_TRACEBACKS`` environment variable that shows
full tracebacks instead of one-line summaries
* Use ``PRETTY_URLS`` by default on all sites (Issue #1838)
* Feed link generation is completely refactored (Issue #2844)
* Added ``extract_metadata`` and ``split_metadata`` to the
``utils`` module, which are used by the metadata extraction
facilities in the ``post`` module.

Bugfixes
--------
@@ -383,8 +383,31 @@ If the compiler produces something other than HTML files, it should also impleme
returns the preferred extension for the output file.

These plugins can also be used to extract metadata from a file. To do so, the
plugin may implement ``read_metadata`` that will return a dict containing the
metadata contained in the file.
plugin must set ``supports_metadata`` to ``True`` and implement ``read_metadata`` that will return a dict containing the
metadata contained in the file. Optionally, it may list ``metadata_conditions`` (see `MetadataExtractor Plugins`_ below)

MetadataExtractor Plugins
-------------------------

Plugins that extract metadata from posts. If they are based on post content,
they must implement ``_extract_metadata_from_text`` (takes source of a post
returns a dict of metadata). They may also implement
``split_metadata_from_text``, ``extract_text``. If they are based on filenames,
they only need ``extract_filename``. If ``support_write`` is set to True,
``write_metadata`` must be implemented.

Every extractor must be configured properly. The ``name``, ``source`` (from the
``MetaSource`` enum in ``metadata_extractors``) and ``priority``
(``MetaPriority``) fields are mandatory. There might also be a list of
``conditions`` (tuples of ``MetaCondition, arg``), used to check if an
extractor can provide metadata, a compiled regular expression used to split
metadata (``split_metadata_re``, may be ``None``, used by default
``split_metadata_from_text``), a list of ``requirements`` (3-tuples: import
name, pip name, friendly name), ``map_from`` (name of ``METADATA_MAPPING`` to
use, if any) and ``supports_write`` (whether the extractor supports writing
metadata in the desired format).

For more details, see the definition in ``plugin_categories.py`` and default extractors in ``metadata_extractors.py``.

RestExtension Plugins
---------------------
@@ -408,7 +408,7 @@ Current Nikola versions experimentally supports other metadata formats that make
other static site generators. The currently supported metadata formats are:

* reST-style comments (``.. name: value`` — default format)
* Two-file format (reST-style comments or 7-line)
* Two-file format (reST-style, YAML, TOML)
* Jupyter Notebook metadata
* YAML, between ``---`` (Jekyll, Hugo)
* TOML, between ``+++`` (Hugo)
@@ -421,7 +421,7 @@ You can add arbitrary meta fields in any format.
When you create new posts, by default the metadata will be created as reST style comments.
If you prefer a different format, you can set the ``METADATA_FORMAT`` to one of these values:

* ``"Nikola"``: reST comments wrapped in a comment if needed (default)
* ``"Nikola"``: reST comments, wrapped in a HTML comment if needed (default)
* ``"YAML"``: YAML wrapped in "---"
* ``"TOML"``: TOML wrapped in "+++"
* ``"Pelican"``: Native markdown metadata or reST docinfo fields. Nikola style for other formats.
@@ -448,6 +448,8 @@ Meta information can also be specified in separate ``.meta`` files. Those suppor
.. slug: how-to-make-money
.. date: 2012-09-15 19:52:05 UTC

You can also use YAML or TOML metadata inside those (with the appropriate markers).

Jupyter Notebook metadata
`````````````````````````

@@ -206,7 +206,7 @@ COMPILERS = ${COMPILERS}
# ONE_FILE_POSTS = True

# Preferred metadata format for new posts
# "Nikola": reST comments wrapped in a comment if needed (default)
# "Nikola": reST comments, wrapped in a HTML comment if needed (default)
# "YAML": YAML wrapped in "---"
# "TOML": TOML wrapped in "+++"
# "Pelican": Native markdown metadata or reST docinfo fields. Nikola style for other formats.
@@ -1149,6 +1149,9 @@ MARKDOWN_EXTENSIONS = ['markdown.extensions.fenced_code', 'markdown.extensions.c
# (Note the '.*\/' in the beginning -- matches source paths relative to conf.py)
# FILE_METADATA_REGEXP = None

# Should titles fetched from file metadata be unslugified (made prettier?)
# FILE_METADATA_UNSLUGIFY_TITLES = True

# If enabled, extract metadata from docinfo fields in reST documents
# USE_REST_DOCINFO_METADATA = False

@@ -1166,10 +1169,6 @@ MARKDOWN_EXTENSIONS = ['markdown.extensions.fenced_code', 'markdown.extensions.c
# }
# Other examples: https://getnikola.com/handbook.html#mapping-metadata-from-other-formats

# If you hate "Filenames with Capital Letters and Spaces.md", you should
# set this to true.
UNSLUGIFY_TITLES = True

# Additional metadata that is added to a post when creating a new_post
# ADDITIONAL_METADATA = {}

@@ -0,0 +1,258 @@
# -*- coding: utf-8 -*-

# Copyright © 2012-2017 Chris Warrick, Roberto Alsina and others.

# Permission is hereby granted, free of charge, to any
# person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the
# Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the
# Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice
# shall be included in all copies or substantial portions of
# the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
# KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
# OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

"""Default metadata extractors and helper functions."""

import re
import natsort

from enum import Enum
from nikola.plugin_categories import MetadataExtractor
from nikola.utils import unslugify

__all__ = ('MetaCondition', 'MetaPriority', 'MetaSource', 'check_conditions')
_default_extractors = []
DEFAULT_EXTRACTOR_NAME = 'nikola'
DEFAULT_EXTRACTOR = None


class MetaCondition(Enum):
"""Conditions for extracting metadata."""

config_bool = 1
config_present = 2
extension = 3
compiler = 4
first_line = 5
never = -1


class MetaPriority(Enum):
"""Priority of metadata.
An extractor is used if and only if the higher-priority extractors returned nothing.
"""

override = 1
specialized = 2
normal = 3
fallback = 4


class MetaSource(Enum):
"""Source of metadata."""

text = 1
filename = 2


def check_conditions(post, filename: str, conditions: list, config: dict, source_text: str) -> bool:
"""Check the conditions for a metadata extractor."""
for ct, arg in conditions:
if any((
ct == MetaCondition.config_bool and not config.get(arg, False),
ct == MetaCondition.config_present and arg not in config,
ct == MetaCondition.extension and not filename.endswith(arg),
ct == MetaCondition.compiler and post.compiler.name != arg,
ct == MetaCondition.never
)):
return False
elif ct == MetaCondition.first_line:
if not source_text or not source_text.startswith(arg + '\n'):
return False
return True


def classify_extractor(extractor: MetadataExtractor, metadata_extractors_by: dict):
"""Classify an extractor and add it to the metadata_extractors_by dict."""
global DEFAULT_EXTRACTOR
if extractor.name == DEFAULT_EXTRACTOR_NAME:
DEFAULT_EXTRACTOR = extractor
metadata_extractors_by['priority'][extractor.priority].append(extractor)
metadata_extractors_by['source'][extractor.source].append(extractor)
metadata_extractors_by['name'][extractor.name] = extractor
metadata_extractors_by['all'].append(extractor)


def load_defaults(site: 'nikola.nikola.Nikola', metadata_extractors_by: dict):
"""Load default metadata extractors."""
for extractor in _default_extractors:
extractor.site = site
classify_extractor(extractor, metadata_extractors_by)


def is_extractor(extractor) -> bool:
"""Check if a given class is an extractor."""
return isinstance(extractor, MetadataExtractor)


def default_metadata_extractors_by() -> dict:
"""Return the default metadata_extractors_by dictionary."""
d = {
'priority': {},
'source': {},
'name': {},
'all': []
}

for i in MetaPriority:
d['priority'][i] = []
for i in MetaSource:
d['source'][i] = []

return d


def _register_default(extractor: MetadataExtractor) -> MetadataExtractor:
"""Register a default extractor."""
_default_extractors.append(extractor())
return extractor


@_register_default
class NikolaMetadata(MetadataExtractor):
"""Extractor for Nikola-style metadata."""

name = 'nikola'
source = MetaSource.text
priority = MetaPriority.normal
supports_write = True
split_metadata_re = re.compile('\n\n')
nikola_re = re.compile(r'^\s*\.\. (.*?): (.*)')

def _extract_metadata_from_text(self, source_text: str) -> dict:
"""Extract metadata from text."""
outdict = {}
for line in source_text.split('\n'):
match = self.nikola_re.match(line)
if match:
outdict[match.group(1)] = match.group(2)
return outdict

def write_metadata(self, metadata: dict, comment_wrap=False) -> str:
"""Write metadata in this extractor’s format."""
metadata = metadata.copy()
order = ('title', 'slug', 'date', 'tags', 'category', 'link', 'description', 'type')
f = '.. {0}: {1}'
meta = []
for k in order:
try:
meta.append(f.format(k, metadata.pop(k)))
except KeyError:
pass
# Leftover metadata (user-specified/non-default).
for k in natsort.natsorted(list(metadata.keys()), alg=natsort.ns.F | natsort.ns.IC):
meta.append(f.format(k, metadata[k]))
data = '\n'.join(meta)
if comment_wrap is True:
comment_wrap = ('<!--', '-->')
if comment_wrap:
return '\n'.join((comment_wrap[0], data, comment_wrap[1], '', ''))
else:
return data + '\n\n'


@_register_default
class YAMLMetadata(MetadataExtractor):
"""Extractor for YAML metadata."""

name = 'yaml'
source = MetaSource.text
conditions = ((MetaCondition.first_line, '---'),)
requirements = [('yaml', 'PyYAML', 'YAML')]
supports_write = True
split_metadata_re = re.compile('\n---\n')
map_from = 'yaml'
priority = MetaPriority.specialized

def _extract_metadata_from_text(self, source_text: str) -> dict:
"""Extract metadata from text."""
import yaml
meta = yaml.safe_load(source_text[4:])
# We expect empty metadata to be '', not None
for k in meta:
if meta[k] is None:
meta[k] = ''
return meta

def write_metadata(self, metadata: dict, comment_wrap=False) -> str:
"""Write metadata in this extractor’s format."""
import yaml
return '\n'.join(('---', yaml.safe_dump(metadata, default_flow_style=False).strip(), '---', ''))


@_register_default
class TOMLMetadata(MetadataExtractor):
"""Extractor for TOML metadata."""

name = 'toml'
source = MetaSource.text
conditions = ((MetaCondition.first_line, '+++'),)
requirements = [('toml', 'toml', 'TOML')]
supports_write = True
split_metadata_re = re.compile('\n\\+\\+\\+\n')
map_from = 'toml'
priority = MetaPriority.specialized

def _extract_metadata_from_text(self, source_text: str) -> dict:
"""Extract metadata from text."""
import toml
return toml.loads(source_text[4:])

def write_metadata(self, metadata: dict, comment_wrap=False) -> str:
"""Write metadata in this extractor’s format."""
import toml
return '\n'.join(('+++', toml.dumps(metadata).strip(), '+++', ''))


@_register_default
class FilenameRegexMetadata(MetadataExtractor):
"""Extractor for filename metadata."""

name = 'filename_regex'
source = MetaSource.filename
priority = MetaPriority.fallback
conditions = [(MetaCondition.config_bool, 'FILE_METADATA_REGEXP')]

def extract_filename(self, filename: str, lang: str) -> dict:
"""Try to read the metadata from the filename based on the given re.
This requires to use symbolic group names in the pattern.
The part to read the metadata from the filename based on a regular
expression is taken from Pelican - pelican/readers.py
"""
match = re.match(self.site.config['FILE_METADATA_REGEXP'], filename)
meta = {}

if match:
for key, value in match.groupdict().items():
k = key.lower().strip() # metadata must be lowercase
if k == 'title' and self.site.config['FILE_METADATA_UNSLUGIFY_TITLES']:
meta[k] = unslugify(value, lang, discard_numbers=False)
else:
meta[k] = value

return meta

0 comments on commit dc47ce9

Please sign in to comment.