Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve markdown compiler performance #2661

Merged
merged 5 commits into from Feb 16, 2017
Merged

Improve markdown compiler performance #2661

merged 5 commits into from Feb 16, 2017

Conversation

@living180
Copy link
Contributor

@living180 living180 commented Feb 14, 2017

This PR fixes issue #2660.

The first commit eliminates pathological performance when compiling a large number of markdown files at one time but preventing the markdown extensions list from growing each time a markdown file was compiled. On my machine this reduced the time required to compile 1000 markdown files from over 11 minutes to under 55 seconds.

The second commit provides a further incremental improvement by creating and reusing a single Markdown object instead of creating a separate Markdown object for each file that is compiled. This further reduces the compilation time for 1000 markdown files to under 34 seconds.

The CompileMarkdown.compile_string() method was always adding the
configured markdown extensions (from MARKDOWN_EXTENSIONS) to
self.extensions.  This meant that each time it was called,
self.extensions grew.  As a result, the markdown conversion grew slower
each time due to the increasing number of extensions.  Fix by adding the
configured extensions once in the set_site() method instead.
Copy link
Member

@Kwpolska Kwpolska left a comment

Any benchmarks?

Also, add this change to CHANGES.txt and your name to AUTHORS.txt.

site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
self.config_dependencies.append(str(sorted(site_extensions)))
extensions.extend(site_extensions)
self.markdown = Markdown(extensions=extensions, output_format="html5")

This comment has been minimized.

@Kwpolska

Kwpolska Feb 14, 2017
Member

This will crash if Markdown does not exist.

This comment has been minimized.

@living180

living180 Feb 15, 2017
Author Contributor

Good catch, thanks.

plugin_info.plugin_object.short_help = plugin_info.description

site_extensions = self.site.config.get("MARKDOWN_EXTENSIONS")
self.config_dependencies.append(str(sorted(site_extensions)))
self.extensions.extend(site_extensions)
extensions.extend(site_extensions)
self.markdown = Markdown(extensions=extensions, output_format="html5")

This comment has been minimized.

@felixfontein

felixfontein Feb 14, 2017
Contributor

Are Markdown objects thread-safe? (You can tell doit to process tasks in parallel using threads, so it could be that two Markdown posts are compiled simultaneously.) I somehow doubt this as you call self.markdown.reset() after self.markdown.convert(data). If this is the case, you should better use thread-local storage for the Markdown object.
(That would also indirectly solve Kwpolska's remark.)

@ralsina
Copy link
Member

@ralsina ralsina commented Feb 14, 2017

I agree, either make it thread local or protect it with a mutex.

@felixfontein
Copy link
Contributor

@felixfontein felixfontein commented Feb 14, 2017

Since this is about performance, making it thread-local sounds like a much better choice than a mutex :)

@living180
Copy link
Contributor Author

@living180 living180 commented Feb 15, 2017

@Kwpolska wrote

Any benchmarks?

In the PR description, I included some timings I measured (basically more than 11 minutes before my changes, then under 55 seconds with just the first commit, and finally under 34 seconds with both commits). Are you looking for something more than that? If so, could you please describe what you would like to see? I'm happy to try to accomodate

@Kwpolska
Copy link
Member

@Kwpolska Kwpolska commented Feb 15, 2017

@living180 Ah, that’s good enough. But please address all our comments so we can proceed with this.

living180 added 2 commits Feb 14, 2017
Stop using the markdown() convenience function in the
CompileMarkdown.compile_string() method, which creates a new Markdown
object each time it is called.

Instead, create a new class, ThreadLocalMarkdown, which is a subclass of
threading.local.  This class has a single attribute, markdown, which is
a per-thread Markdown instance.  It also has a single method, convert(),
which converts the data using the Markdown instance's convert() method,
and then resets the Markdown instance's internal state.

Then in the CompileMarkdown.set_site() method, instantiate a new
ThreadLocalMarkdown instance (unless the markdown module is unavailable)
and assign it to the plugin's converter attribute.  In the
compile_string() method, use that ThreadLocalMarkdown instance to
perform the conversion.

The end result is one Markdown object created per thread, instead of one
for each markdown file compiled.
@living180
Copy link
Contributor Author

@living180 living180 commented Feb 15, 2017

Updated version - should not fail ungracefully if the markdown module is not installed, and is now thread-safe for parallel builds. (Testing with nikola build -n 4 reduces the runtime to 14s on my machine.)

"""Convert to markdown using per-thread Markdown objects."""

def __init__(self, extensions):
self.markdown = Markdown(extensions=extensions, output_format="html5")

This comment has been minimized.

@felixfontein

felixfontein Feb 15, 2017
Contributor

I'm not sure whether this is the correct way to use threading.local. Every thread creates a copy of this initial value, but depending on how this copy is created (maybe even none at all, by just copying the reference to the Markdown object) the resulting object is not thread-safe.

It might be better to set self.markdown to None, and to check self.markdown in convert() and initialize it there when its current value is None.

This comment has been minimized.

@felixfontein

felixfontein Feb 15, 2017
Contributor

Ok, after reading some more about this, I realize I'm wrong and your approach is correct. threading.local is a really strange beast, it seems. (For anyone interested, it seems the only "real" documentation can be found in the docstring here: https://svn.python.org/projects/python/trunk/Lib/_threading_local.py.)

This comment has been minimized.

@Kwpolska

Kwpolska Feb 15, 2017
Member

Here’s some test code:

import threading
from markdown import Markdown

class ThreadLocalMarkdown(threading.local):
    """Convert to markdown using per-thread Markdown objects."""

    def __init__(self, extensions):
        self.markdown = Markdown(extensions=extensions, output_format="html5")
        self.counter = 0

    def convert(self, data):
        """Convert data to HTML and reset internal state."""
        result = self.markdown.convert(data)
        self.markdown.reset()
        self.counter += 1
        print(self.counter)
        return result

    def show(self):
        print(self.counter)

converter = ThreadLocalMarkdown([])

def run_convert():
    converter.convert("Hello, **world!**")

threads = [threading.Thread(None, run_convert) for i in range(10)]
[t.start() for t in threads]
converter.show()

says a bunch of ones and a zero at the end, as expected.

This comment has been minimized.

@felixfontein

felixfontein Feb 15, 2017
Contributor

With an integer it wouldn't be a problem anyway, because integers are immutable. If what I first thought would happen would really happen, you'd need a list or some other mutable object to test for it.

After reading the docstring for _threading_local, I'm happy with the code. I still think that threading.local viloates the principle of least astonishment.

This comment has been minimized.

@Kwpolska

Kwpolska Feb 16, 2017
Member

I’m taking this as an approval and merging.

This comment has been minimized.

@felixfontein

felixfontein Feb 16, 2017
Contributor

Yep, I'm happy with this PR. I'm just not happy how threading.local was designed ;)

@Kwpolska Kwpolska requested a review from felixfontein Feb 15, 2017
@Kwpolska Kwpolska removed the request for review from felixfontein Feb 16, 2017
@Kwpolska Kwpolska merged commit a50f8eb into getnikola:master Feb 16, 2017
2 of 3 checks passed
2 of 3 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
codacy/pr Good work! A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Kwpolska
Copy link
Member

@Kwpolska Kwpolska commented Feb 16, 2017

Thanks for contributing to Nikola! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants