Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path metadata issues with date, category, lang, slug (when not mandatory) #1128

Closed
fvsch opened this issue Oct 21, 2013 · 10 comments
Closed

Comments

@fvsch
Copy link

fvsch commented Oct 21, 2013

I’ve been playing with path metadata and regexp, trying to extract “core” metadata from filename and path. This is what I ended up with:

PATH_METADATA = r'\A(?P<category>[^/]+)/((?P<slug>[a-zA-Z0-9_-]*)/)?'
FILENAME_METADATA = r'(?P<date>\d{4}-\d{2}-\d{2})?[ _-]?(?P<title>[^/.]+)(\.(?P<lang>[a-z]{2,3}))?\Z'

The goal is to match date, category, slug and lang information in file paths, such as:

category/article-slug/2013-10-15 This is the article’s title.en.md

… while making only two of these patterns mandatory: category and title (the other ones can then be defined in the file itself). So we could have:

category/Whatever.md

The regex works (I tested it quite a bit). But there's a number of issues with how Pelican uses the data. It seems that once the regex is compiled, all the named groups that matched nothing en up with a value of None, and Pelican doesn’t know what to do with those in this context. Hence a bunch of errors or issues.

Missing dates yield an error

WARNING: Could not process words/test.md
expected string or buffer

After some testing this is because the date named group in the regex returns None. It doesn’t matter if the file test.md has inline date metadata: it still creates an error and the file is skipped.

Missing slug

WARNING: empty slug for u'/Users/…/words/2012-08-05 Another test.md'.
You can fix this by adding a title or a slug to your content

Pelican does not try to generate a slug from the title matched in the regex, or from the title metadata in the file itself. Or if it does use the title metadata in the file to generate a slug, it gets overwriten with a None value, or something like that.

lang set as None

I was trying to output the articles as:

ARTICLE_SAVE_AS = '{category}/{slug}/index.{lang}.html'

When the lang metadata wasn't set, and matched as None in path metadata, I ended up with output paths such as:

/some-category/some-slug/index.None.html

category clashes with USE_FOLDER_AS_CATEGORY

With the above regex and a source file at my-cat/my-slug/index.en.md, the regex matches category='my-cat', slug='my-slug' alright. But the output is then:

/my-slug/my-slug/index.en.html

if USE_FOLDER_AS_CATEGORY is True.
The simple solution is to set USE_FOLDER_AS_CATEGORY to False. But maybe the USE_FOLDER_AS_CATEGORY behavior shouldn't overwrite the category metadata if it's already set/extracted?

@herbstk
Copy link

herbstk commented Aug 25, 2014

I would like to know if any of this behavior has been tackled since @fvsch brought it up.

For a project I also would like to extract the metadata. In more detail I would like to extract lang and slug from the filename using this RE string '(?P<slug>\w+)\.?(?P<lang>[a-z]{0,3})'. I have the three languages "en", "de" and "fr" within the project, where "fr" is the the default language. For each page I will therefore have three versions, e.g. index.md (for "fr"), index.en.md and index.de.md.

If the metadata lang is not set in index.md the "fr" version does not appear at all in the output. I guess internally the RE string returns an emtpy string for the named group lang for the filename index.md which overwrites the default project-wide setting. Would you have any suggestions for this? I suppose it would make more sense, to fall back on the default values if neither file-/path-metadata-matching nor the frontmatter returns a set metadatavalue.

@avaris
Copy link
Member

avaris commented Aug 29, 2014

We can skip setting empty/None groups for FILENAME_METADATA/PATH_METADATA. That's fairly trivial. I guess there also needs to be definite order for overriding metadata that is documented. For example:

From least important to most important:

  • DEFAULT_METADATA
  • Other defaults defined in the settings (such as AUTHOR, DEFAULT_CATEGORY, DEFAULT_LANG) [*]
  • USE_FOLDER_AS_CATEGORY
  • PATH_METADATA
  • FILENAME_METADATA
  • explicit metadata in the file

Any thoughts? cc: @justinmayer @ametaireau @smartass101

[*] Also on a related note: we might consider deprecating these and go with the single setting DEFAULT_METADATA. It'll clean up the settings a bit.

@foresto
Copy link
Contributor

foresto commented Nov 17, 2014

I've been avoiding the problem of regex named groups being set to None by using an optional group within the named group, like this:

PATH_METADATA = '(?P<dirpath>(.+/)?).*'

Notice how the matching pattern is within that inner group? If no text matches, the named group will still have a value: an empty string.

My workaround means making regular expressions a bit more complex than they should have to be, though, and it isn't exactly obvious. It would nice if Pelican did something more intuitive here. Maybe replacing None values with empty strings would be good (though I haven't given much thought to possible consequences of such a change).

@naturallymitchell
Copy link
Contributor

Slug can now come from filename with SLUGIFY_SOURCE = 'basename'. Lang in basename is probably incompatible with that. Perhaps multiple, mergable content directories might work better, eg en_content and fr_content.

@foresto
Copy link
Contributor

foresto commented Nov 18, 2014

This issue report comprises serveral problems:

  1. Pelican's FILENAME_METADATA / PATH_METADATA regex matching uses Python's default value of None for named groups. It should probably use '' instead. The change would be trivial: just make parse_path_metadata() call match.groupdict('') instead of match.groupdict(). It would also let me write simpler regexes than the ones I've been crafting to work around the problem.
  2. Pelican allows metadata with empty values to override previously-defined metadata. This is also the cause of issues Tag with empty name created for line tags:  #1398 and Empty slug causes generation of hidden file #1469. Pull request Ignore empty metadata. Fixes #1469. Fixes #1398. #1491 fixes it.
  3. As @avaris pointed out, the USE_FOLDER_AS_CATEGORY problem is probably just a matter of defining (and documenting) clear priorities for each source of metadata.

@jpli
Copy link
Contributor

jpli commented Sep 19, 2016

I'm trying to fix the USE_FOLDER_AS_CATEGORY problem (commit 0f6b985).

@bberberov
Copy link
Contributor

I have a question about the behavior of USE_FOLDER_AS_CATEGORY.

Given the following setup:

pelicanconf.py

USE_FOLDER_AS_CATEGORY = True
PATH = 'content'
ARTICLE_PATHS = ['articles']
PAGE_PATHS = ['pages']

Any articles in articles would get that as a category, and pages in pages would get that as a category, if not set otherwise.

If I changed it to:

pelicanconf.py

USE_FOLDER_AS_CATEGORY = True
PATH = 'content'
ARTICLE_PATHS = ['article_src_1', 'article_src_2']
PAGE_PATHS = ['page_src_1', 'page_src_2']

Then articles and pages in the two different folders (e.g. article_src_1 and article_src_2) would get different categories, even though they are both in what I would consider "top-level" folders. Is this the desired behavior, and if we wanted something different in the generated content, we should handle it at the theme level? As an alternative, what about using DEFAULT_CATEGORY for all articles and pages in top-level folders, when the category is not set by other means?

@avaris
Copy link
Member

avaris commented Mar 17, 2019

@bberberov Yes, that is the intended behavior. USE_FOLDER_AS_CATEGORY will use the immediate folder name for category (path relative to content root).

articles/file.md  --> category='articles'
articles/foo/file.md --> category='foo'

One way of doing what you want would be utilizing PATH_METADATA:

PATH_METADATA = '.*/(?P<category>[^/]+)/[^/]+'
DEFAULT_CATEGORY = 'default'
USE_FOLDER_AS_CATEGORY = False
ARTICLE_PATHS = ['article_src_1', 'article_src_2']

then it should assign categories for articles immediately inside article_src_1 or article_src_2 as default and articles in subfolders should get their category from folder names. (assuming my regex is correct :) ). i.e.:

article_src_1/article.md  --> category='default'
article_src_2/article.md  --> category='default'
article_src_1/foo/article.md  --> category='foo'

EDIT:

As an alternative, what about using DEFAULT_CATEGORY for all articles and pages in top-level folders, when the category is not set by other means?

That is what DEFAULT_CATEGORY already does. But if you set USE_FOLDER_AS_CATEGORY, you'll pretty much always have a category unless you put stuff directly inside content folder.

@bberberov
Copy link
Contributor

@avaris I see. The regex is correct, by the way.

I'll just use DEFAULT_CATEGORY to compare with inside my theme and the user will have to worry about setting up the metadata correctly. Thanks.

@justinmayer
Copy link
Member

Given the lack of activity regarding this topic, I think it's best to close this issue for now. If someone wants to implement a solution and discuss it further, please post a comment here and we can resume discussion about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants