Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compound file extensions. Fix #2780 #2816

Closed
wants to merge 1 commit into from
Closed

Support compound file extensions. Fix #2780 #2816

wants to merge 1 commit into from

Conversation

holden-nelson
Copy link

@holden-nelson holden-nelson commented Oct 29, 2020

Before this commit, the file extension matcher could not match compound
file extensions; e.g., '.md.html' would match .html

Pull Request Checklist

Resolves: #2780

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code
  • Updated documentation for changed code

Before this commit, the file extension matcher could not match compound
file extensions; e.g., '.md.html' would match `.html`
@avaris
Copy link
Member

avaris commented Oct 29, 2020

Some problems:

  • Pretty sure this will fail to include some.file.with.dots.rst
  • You don't actually handle reader selection logic (inside readers.py:Readers.read_file). foo.md.html would already be selected with previous code, if a reader that handles .html was enabled. The issue is that it was triggering HTMLReader instead of a reader specific to md.html. And same still happens.

@holden-nelson
Copy link
Author

I see.

I'll admit, and this is probably obvious, but I'm kind of new to open source submissions and all that, so I probably jumped the gun by making this change without a better holistic view of how Pelican works. But I'm very determined to contribute, so I will spend more time getting acquainted, maybe write a plugin, and then revisit this topic.

Thanks!

@avaris
Copy link
Member

avaris commented Oct 30, 2020

That's all right and we appreciate the contribution :). If you need any help along the way, feel free to ask.

@justinmayer
Copy link
Member

Hi @holden-nelson. Just wanted to check in regarding the status of this PR. Do you intend to keep working on it?

@holden-nelson
Copy link
Author

Hello @justinmayer. I had kind of put this on the backburner as other projects and real life happened, but ultimately I would like to do this. I'm finally wrapping up some of that other stuff, but I wouldn't expect to see any progress on this, on my end, for another couple of months. Let me know how that works on your end, I understand if you'd like to close it and move on.

Thanks!

@holden-nelson
Copy link
Author

Hello @justinmayer and @avaris. I'm ready to make this contribution, but I need to know how to handle the situation regarding a filename with multiple dots, like some.file.with.dots.rst.

Right now, extensions are stripped out of file names with os.path.splitext(basename)[1][1:] which will regard everything after the last dot, if there is a dot, as the extension. My thought was to use Pathlib.Path.suffixes() to grab everything that appears after the first dot and join it together to get an extension, like in my first pull request. But then a file with dots in its name would have the wrong extension and wouldn't be properly read.

I imagine it would come down to either

  • Closing the issue and not allowing compound file extensions.
  • Having a setting ALLOW_COMPOUND_FILE_EXTENSIONS=False with a big warning that if you set it equal to True you can't have file names with dots anywhere in your article path or it will break. And then in the code we can account for both scenarios.

Or is there some other way to allow arbitrary compound file extensions and dots in file names? What are your thoughts on this?

@avaris
Copy link
Member

avaris commented Dec 28, 2021

File with dots can be handled if we use a_string.endswith(extensions). The real challenge is selecting the appropriate reader. e.g. foo.md.html should trigger a reader with .md.html extension rather than .html one. I'm not sure how to handle that nicely (yet).

@holden-nelson
Copy link
Author

Forgive me if this comment shows ignorance of the Pelican software.

I see how a_string.endswith() could be used to check for a particular file extension, but I don't see how it's useful for an arbitrary compound one if we still allow dots in file names. I guess what I'm saying is, the way I see it,

  • if we're only trying to add support for md.html files specifically, we could add a check in Generators.generate_context() for f.endswith('.md.html') . If True, we call Readers.read_file() on it with the fmt=md.html parameter. Otherwise we call it without the fmt parameter. We'd need a quick check in Generators._include_path() well. I don't know that this is the most clean way, especially if down the road we wanted to add support for more compound extensions, but conceptually something like this.
  • If we want to add support for arbitrary compound extensions and dropped support for filenames with dots we should be able to just switch to Pathlib.Path.suffixes() and be good.
  • I don't understand how it's possible to allow both arbitrary compound extensions and dots in filenames. Like beyond Python specifically I don't understand how it's logically possible.

@avaris
Copy link
Member

avaris commented Jan 2, 2022

If we are going to support this, it should be generic and not pinned down to a specific extension. Also I don't see the particular issue with filenames with dots. It's a pattern matching problem. We are trying to match the longest possible extension that matches available extensions. So a file like file.with.dots.md.html can be considered as one of the following possibilities:

file.with.dots.md + .html
file.with.dots + .md.html
file.with + .dots.md.html
file + .with.dots.md.html

Now the problem is which of these combinations is compatible with available readers. And the challenging part would be selecting the longest matching extension from available readers. One possible (and simple) option is "ranking the extensions and their readers" (say, based on the number of dots) and going them one by one starting from the highest until a match is found and that would be the preferred reader. But this would generally be slower than current approach, which is extract the last extension and check if it is in a dict...

Another approach is building a "tree of readers" based on the extensions and each level of compound extension makes it one more level deep. Something like

readers = {
    '.html': {
        '': HTMLReader,
        '.md': MDHTMLReader,
    },
    # ...
}

So, once .html is matched, it looks for the next extension and so on until it can't match anymore. You can think of it like working backwards from pathlib.Path.suffixes. This is quite a bit more involved but would perform a lot better.

@holden-nelson
Copy link
Author

Ahh. Okay, now I see how that would work. Thanks for taking the time to explain.

I'll continue working on this feature and submit a PR soon.

holden-nelson added a commit to holden-nelson/pelican that referenced this pull request Jan 7, 2022
Created a new class ReaderTree that is an infinitely
nested defaultdict containing components of the extension.
See comments on PR getpelican#2816.
holden-nelson added a commit to holden-nelson/pelican that referenced this pull request Jan 7, 2022
Created a new class ReaderTree that is an infinitely
nested defaultdict containing components of the extension.
See comments on PR getpelican#2816.
holden-nelson added a commit to holden-nelson/pelican that referenced this pull request Jan 7, 2022
Created a new class ReaderTree that is an infinitely
nested defaultdict containing components of the extension.
See comments on PR getpelican#2816.
holden-nelson added a commit to holden-nelson/pelican that referenced this pull request Mar 14, 2022
Created a new class ReaderTree that is an infinitely
nested defaultdict containing components of the extension.
See comments on PR getpelican#2816.
@justinmayer justinmayer requested a review from avaris June 24, 2023 09:45
@justinmayer justinmayer deleted the branch getpelican:master June 25, 2024 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support compound file extensions
3 participants