Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extendable metadata extractors (#2830) #2861

Merged
merged 25 commits into from Jul 9, 2017
Merged

Extendable metadata extractors (#2830) #2861

merged 25 commits into from Jul 9, 2017

Conversation

@Kwpolska
Copy link
Member

@Kwpolska Kwpolska commented Jul 5, 2017

This is #2830 — extendable metadata extractors.

Every extractor plugin specifies the source of information (currently supported are text and filename) and its priority (based on how well it knows it’s needed — YAML/TOML use markers, so they’re specialized; Nikola is normal, and FileMetadataRegexp is fallback). This tries to be extendable, although I probably missed a few things regarding openness.

Highlights:

  • YAML and TOML support in 2-file posts
  • FILE_METADATA_REGEXP still supported
  • Nikola-site and a lot of test stuff works properly, without changes
  • Posts know what metadata extractor was used, if they are 1-file posts
  • Compilers must advertise support for metadata, can use conditions, and have the highest priority
  • Metadata is split for parsing (and then again when compiling)
  • I’m doing modern Python 3: using enums and (basic) type annotations. This adds enum34 for Python 3.3 (do we need that version though?)
  • I had to effectively undo #2856 as part of the implementation, since the new extractors have different requirements
  • Some compiler APIs were changed as well, so old compilers might not work
  • If someone needs direct access to the Nikola extractor, it’s metadata_extractors.DEFAULT_EXTRACTOR — but most of the time, you should go through metadata_extractors_by or other convenience methods.
  • As I promised, the built-in extractors are not loaded by yapsy. This sort-of optimization could also be useful for a few other mandatory plugins, such as PostScanner or render_posts/render_pages

Questions:

  • What was match used for in re_meta? I didn’t see any uses, so I nuked it.
  • I removed support for \r\n when splitting metadata. The reason? This happens only when dealing with Windows-style newlines on Linux. (unless I’m wrong. We can restore it just in case.)
  • Right now, metadata_extractors_by is on the site object, but I’m not sure about it — perhaps it would work better in metadata_extractors.py as a global?
  • I kinda want to change create_post compiler API to def create_post(self, metadata, options): — opinions?

Future and caveats:

  • Compiler plugins may need changes for compatibility, I didn’t do that.
  • pkgindex needs a rewrite. I didn’t do that either.
  • I plan to extend annotations over to all plugin categories. PyCharm is oh-so-nice when you add type annotations.
  • We still parse reST/Markdown posts twice if using their respective meta formats. Memory usage would skyrocket if we tried to avoid that.

Comments and reviews welcome.

Kwpolska added 19 commits Jul 3, 2017
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@getnikola getnikola deleted a comment Jul 6, 2017
@getnikola getnikola deleted a comment Jul 6, 2017
@getnikola getnikola deleted a comment Jul 6, 2017
Copy link
Contributor

@felixfontein felixfontein left a comment

This looks really great! There are two things in plugin_categories.MetadataExtractor which you should definitely take a look at, but everything else looks fine!

"""Check the conditions for a metadata extractor."""
for ct, arg in conditions:
if any((
ct == MetaCondition.config_bool and (arg not in config or (arg in config and not config[arg])),
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

Why not ct == MetaCondition.config_bool and not config.get(arg, False), or ct == MetaCondition.config_bool and (arg not in config or not config[arg])?

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Brain fart. Fixed.

"""Extract metadata from filename."""
return {}

def write_metadata(self, metadata: dict, comment_wrap=False) -> str:
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

There should be a warning/note that such a method can empty the metadata dictionary. (The default Nikola extractor does that, for example.)

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Better solution, change the behavior of Nikola’s extractor. A copy of a small dict shouldn’t hurt.

Copy link
Contributor

@felixfontein felixfontein Jul 8, 2017

That's even better :)

priority = MetaPriority.normal
supports_write = True
split_metadata_re = re.compile('\n\n')
nikola_re = re.compile('^\s*\.\. (.*?): (.*)')
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

Wouldn't it be better for this to be a raw string? I know it makes no difference, but for someone not that familiar with Python strings, it is not so easy to guess whether '\s' is an escape sequence or does something strange. (Also, you're writing '\\+' in the TOML metadata regex, should could also be written as '\+'.)

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

In Nikola, you’re right. In TOML, this is intended:

split_metadata_re = re.compile('\n\\+\\+\\+\n')

We need \n for a real newline. So the double backslash becomes a single backslash, and it ends up escaping the + from regexp’s point of view.

Copy link
Contributor

@felixfontein felixfontein Jul 8, 2017

I know; I meant that '\+' == '\\+', which is true for the same reason than '\s' == '\\s' (because \s is no valid escape sequence).

Copy link
Member Author

@Kwpolska Kwpolska Jul 8, 2017

Copy link
Contributor

@felixfontein felixfontein Jul 8, 2017

Makes sense; I was surprised that it works. I was just confused because here you seemed to use the (deprecated) behavior, while further down you were escaping, and I didn't know if that was intentional or not. But now that's resolved :)

nikola/nikola.py Outdated
@@ -728,6 +732,11 @@ def __init__(self, **config):
if self.config['PRESERVE_EXIF_DATA'] and not self.config['EXIF_WHITELIST']:
utils.LOGGER.warn('You are setting PRESERVE_EXIF_DATA and not EXIF_WHITELIST so EXIF data is not really kept.')

# TODO: should we keep backwards compat here?
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

It's nice to have a warning for some removed/renamed features for at least one or two releases. If someone is still using this feature with an up-to-date Nikola, he at least will notice that something changed :)

def split_metadata_from_text(self, source_text: str) -> (str, str):
"""Split text into metadata and content (both strings)."""
if self.split_metadata_re is None:
return source_text
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

Shouldn't this be return '', source_text?

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Not quite, see below.

return split_result[0], split_result[-1]

def extract_text(self, source_text: str) -> dict:
"""Split file, return metadata and the content."""
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

The function is doing something else. It returns the parsed metadata, but not the content.

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Leftover from a previous design. Fixed.

@@ -166,7 +166,7 @@ def write_metadata(filename, title, slug, post_date, description, tags, **kwargs
with io.open(filename, "w+", encoding="utf8") as fd:
data = {'title': title, 'slug': slug, 'date': post_date, 'tags': ','.join(tags), 'description': description}
data.update(kwargs)
fd.write(utils.write_metadata(data))
fd.write(utils.write_metadata(data, metadata_format='nikola', comment_wrap=False))
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

Even though it's technically not needed (because nikola is always provided by Nikola core), why not pass site as well?

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

There is no site in the importer plugins.

Copy link
Contributor

@felixfontein felixfontein Jul 8, 2017

But you still have a site object (with all default options)?

Copy link
Member Author

@Kwpolska Kwpolska Jul 8, 2017

Do we really? I can see references to that in the WordPress importer, but I have no idea where it comes from…

Copy link
Contributor

@felixfontein felixfontein Jul 8, 2017

Every plugin has a set_site call which is called on plugin initialization. There is no documentation saying that this isn't the case for Command plugins, and in particular importer plugins. So I would assume it is a bug if the site object isn't there.

Copy link
Member Author

@Kwpolska Kwpolska Jul 8, 2017

Ah, right. I misremembered needs_config — turns out a site always exists, but isn’t always required to be configured. I’ll fix that.

nikola/post.py Outdated
except AttributeError:
config = None
config = getattr(post, 'config', None)
metadata_extractors_by = getattr(post, 'metadata_extractors_by', metadata_extractors.default_metadata_extractors_by())
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

This always calls metadata_extractors.default_metadata_extractors_by(), even though it should never be used. Wouldn't it be better to add two more lines (if metadata_extractors_by is None: / metadata_extractors_by = metadata_extractors.default_metadata_extractors_by()) to avoid this call every time?

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Fixed.

nikola/utils.py Outdated
"""
# API compatibility
if metadata_format is None:
metadata_format = site.config.get('METADATA_FORMAT', 'nikola').lower()
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

In case site is done, this dies.

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

Fixed.

nikola/utils.py Outdated
extractor.check_requirements()
return extractor.write_metadata(data, comment_wrap)
elif metadata_format is None:
pass # Quiet fallback to Nikola
Copy link
Contributor

@felixfontein felixfontein Jul 6, 2017

This won't be quiet, because the new if below will print out a warning.

Copy link
Member Author

@Kwpolska Kwpolska Jul 7, 2017

I made sure metadata_format will never be None.

@felixfontein
Copy link
Contributor

@felixfontein felixfontein commented Jul 6, 2017

It's nice to see that Python also got type annotations. Also, I'm for enums; whether it will be enum34 or Python 3.4+ only, I don't mind.

@tritium21
Copy link
Contributor

@tritium21 tritium21 commented Jul 7, 2017

What platforms ship 3.3 as the default python3? Are those platforms still supported? I'm pretty sure ubuntu 12.04 was 3.2, but 14.04 was 3.4, but i could be wrong.

@felixfontein
Copy link
Contributor

@felixfontein felixfontein commented Jul 7, 2017

Debian 8 (jessie == oldstable) ships with Python 3.4, and Debian 9 probably with something newer.

Python 3.4 was released in March 2014, so Ubuntu 14.04 supports either 3.3 or 3.4, with 3.3 being more likely. I have no Ubuntu 14.04 boxes left to check, though...

@Kwpolska
Copy link
Member Author

@Kwpolska Kwpolska commented Jul 7, 2017

You don’t need boxen to check: 14.04 trusty ships 3.4.0, 9 stretch ships 3.5.3

Kwpolska added 3 commits Jul 7, 2017
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
Note this is a minor API change (removes @staticmethod) that needs to be
reflected in some importers.

cc @felixfontein

Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
Signed-off-by: Chris Warrick <kwpolska@gmail.com>
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
@getnikola getnikola deleted a comment Jul 8, 2017
@Kwpolska
Copy link
Member Author

@Kwpolska Kwpolska commented Jul 8, 2017

@felixfontein: thanks!

@ralsina: It’d be nice if you reviewed, too.

@ralsina
Copy link
Member

@ralsina ralsina commented Jul 8, 2017

@Kwpolska I'll give it a look this weekend.

ralsina
ralsina approved these changes Jul 9, 2017
Copy link
Member

@ralsina ralsina left a comment

LGTM and is pretty awesome!

@ralsina ralsina merged commit dc47ce9 into master Jul 9, 2017
5 checks passed
5 checks passed
codacy/pr Good work! A positive pull request.
Details
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ralsina ralsina deleted the metadata-extractors branch Jul 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants