Skip to content
This repository has been archived by the owner on May 19, 2020. It is now read-only.

Feature: Store both Markdown and rendered HTML when saving pages and posts #744

Closed
james2doyle opened this issue Oct 13, 2014 · 28 comments
Closed
Assignees
Milestone

Comments

@james2doyle
Copy link
Contributor

I was looking through the Markdown class yesterday, wondering if Parsedown could replace it, and I noticed that the database only saves the Markdown content for posts and pages.

I was wondering why is this? Wouldn't it be more efficient to save the markdown and the HTML into the database instead of having a "render content as HTML" function for the frontend? The markdown for a page is probably modified much less than the number of times the HTML is viewed by site visitors.

The article_html function could be reduced to the same weight as reading any other static property of a page or post.

I can tackle this feature if there is an actual need for it. I realize adding Parsedown, would be a different conversation.

@james2doyle
Copy link
Contributor Author

Did you want me to tackle this feature @CraigChilds94?

@daviddarnes
Copy link
Member

I don't what the implications of this are, but wouldn't this add unnecessary complications to the system? How much faster would the page load time be? page load seems pretty fast at the moment? On the other hand it would be cool to have the option to switch out different types of markdown.

@CraigChilds94
Copy link
Member

I think what you're ideally suggesting is a caching system which is refreshed upon edit?

So any rendered html is cached somewhere but as soon as the appropriate markdown for that output is modified we then reparse it?

@james2doyle
Copy link
Contributor Author

This wouldn't add "complications". It would just move the parse function out of the functions.php for the frontend, and move them into the save function in the pages/posts model. Then there would be 1 additional column added to the pages and posts table. That's really it.

It may seem frivolous, but why not do it properly and make it as fast as it can be? Page speed affects user experience, as well as Google page rankings.

I think I might do a few little tests/benchmarks to see how much of a difference it really makes.

@CraigChilds94
Copy link
Member

That'd be a great idea, we can't really tell until we have some evidence... other than that maybe try implementing simple caching using files or memcache or something?

@james2doyle
Copy link
Contributor Author

Caching would be great actually.

This suggestion is actually much simpler and less additional code then adding a cache. It's moving a couple of functions and adding a database column.

As far as I understand, memcached is for caching database queries. The issue I'm pointing out here would not be fixed with extra caching. I'm talking about reducing the amount of string and data manipulation, after the database has been queried.

@CraigChilds94
Copy link
Member

I understand what you mean, but my suggestion would also happen at the same time but it would reduce the number of database queries. I think the best thing to do is some performance testing. If you provide us with some results that'd be great!

It would also be nice to take into account the extra storage required in the database. Let's say the max size of a post in markdown is 16MB, the html field would have to be larger than that (As the markdown will produce a larger output than input). Which means given that a post/page is a MEDIUM TEXT type, the html field would have to be a LONG TEXT to accommodate the generated tags.

So for any post you could be storing ~40MB in the database. Let's say you have 10 posts... that's 400MB... where as if you're only storing the MD in the DB you'd only be using a max of 160MB for those 10 posts.

This is just something you'll need to think about...

@james2doyle
Copy link
Contributor Author

I don't think we will ever have to worry about anything like that. The whole bible fits in a 4.5mb text file.

If someone had a 16mb Markdown/HTML entry, that would be 16,000,000 characters or about 8,000 pages of text.

I'm not trying to be condescending or anything, but every other reputable CMS stores HTML in their database.

@daviddarnes
Copy link
Member

Is the whole bible in rich content or plain text? ;)

I know that WordPress does store the HTML in the database, but is it a good idea? This I'm completely unsure of, both your angles have benefits. I truly think some example tests would be beneficial to this.

@james2doyle thanks for your efforts on this, we know you're just trying to make Anchor better and thats awesome :).

@james2doyle
Copy link
Contributor Author

Yeah it was plain text. The rich text is 6.5mb. I didn't have this information on hand, but I knew that 16mb was a lot of text so I looked up the "size of the bible in plain text". I found it on this site. The files are all ZIP, so if you want to know the real size, you need to download and unzip them.

They actually have a lot of different formats so if you were interested in the size differences in text formats then this is a pretty good place to find examples.

@james2doyle
Copy link
Contributor Author

Results!

This is the Markdown file I used to test with.

I've use it a lot in the past to theme and style Markdown-rendered content, since it contains all the elements. Now I realize that Anchor does not parse all the tags in this file but that is fine.

I copied the resulting HTML from the output Anchor rendered and pasted that into the content for the post.

Here are the specs for my computer:

  • 2011 MacBook Pro (128GB SSD, 2.3 GHz Intel Core i5, 8 GB RAM)
  • PHP 5.5.16
  • MySQL (I use MariaDB, Ver 15.1 Distrib 10.0.12-MariaDB, for osx10.9 (i386) using readline 5.1)
  • Apache/2.4.9

Ok so here is my little bench code. I modified articles.php with the following:

function article_markdown() {
    $time_start = microtime(true);
    $content = parse(Registry::prop('article', 'html'));
    $time_end = microtime(true);
    $time = $time_end - $time_start;
    return "<p>$time</p>".$content;
}

So we take a timestamp at the start of the function and then one at the end. Then prepend the results to the returned HTML. Simple! When parsing the raw database HTML, I add false as the last argument in the parse function.

Here are the results:

Content Test Type Average Lowest Time Average Highest Time Average Average Difference
Markdown (Default Anchor) 0.0073 0.012 0.0097 0% (Baseline)
HTML parsed as Markdown 0.0058 0.0086 0.0072 +25.77%
HTML (Saved In Database) 0.00016 0.00023 0.0002 +97.94%

I didn't expect these kinds of results to be honest. I thought it would only be like ~20% faster or maybe 50% max. This is almost twice as fast.

We should keep in mind the specs for my computer are not the same as the typical server. I would love to see what kind of numbers you guys get on average.

@CraigChilds94
Copy link
Member

@daviddarnes what do you think of this?

@james2doyle Can I ask how many times you ran the tests? Also just to clarify, the above code is for testing the parsing of the html? and with the true parameter it is not parsing just returning the html?

These are promising results as far as speed is concerned :) thanks for taking the time to research this!

@jakecleary
Copy link
Contributor

👍 for storing the HTML.

@daviddarnes
Copy link
Member

Well can't disagree with the numbers! Its more your call @CraigChilds94, if you think its beneficial and not too much dev work then go for it. I guess in simple terms we're shifting the load time from the viewer to the content manager, because the admin will be loading the markdown from the HTML in the DB (is that correct?).

@james2doyle
Copy link
Contributor Author

I ran each test about 20 or so times. Since I was trying to get the most honest low time and high time. Then I just took the average and calculated the percentage change.

I really think someone else should run this on a "real" server. Like a GoDaddy or a DigitalOcean. Then we can get some more results. It is hard to argue with the preliminary results though.

@CSCG
Copy link

CSCG commented Oct 21, 2014

Contact me directly if you need an FTP account on a DigitalOcean Droplet or
a Dreamhost node, I've got some spare space on both as well as a California
based VPS I believe, I'd be happy to help the testing...these results pique
my interest both within and outside this project

@ghost
Copy link

ghost commented Oct 23, 2014

Anchor used to store the HTML in the database. No reason why it can't again, those are some impressive numbers.

@james2doyle
Copy link
Contributor Author

I just tested this on my Digital Ocean droplet (512mb, 20Gb) and the results were pretty much the same.

The average default time was 0.01, when reading HTML from the DB it was 0.0002 average.

If everyone is in agreement, I can submit a new PR. Aye?

@systimotic
Copy link

Aye aye captain Doyle! I'd agree with it!

Also, why did Anchor stop storing the HTML? We might need to take that into consideration when putting it back in, depending on the reason.

@CraigChilds94
Copy link
Member

Probably because it wasn't using markdown? I can't remember!go ahead create a pr and I'll check it out :)

@systimotic
Copy link

Yeah... That would make sense 😄

@meskarune
Copy link

I know this issue already has a pull request, but I don't agree that storing both html and markdown is the way to go. It is very redundant (essentially you are storing the same data in different formats) and sorta invalidates the whole reason for using/storing markdown in the first place. (if you store html, why even have markdown in the database?)

The real solution to speed up anchor and reduce database hits is to use caching.

You can configure an opcode php cache easily and then drop varnish onto your web server stack.

Opcode caching in php 5.5:

edit php.ini, add/uncomment the following:

zend_extension=opcache.so
opcache.memory_consumption=512
opcache.max_accelerated_files=50000

For older versions of php, you can just as easily use APC.

Quick varnish tutorial for Apache: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-varnish-with-apache-on-ubuntu-12-04--3

There are of course many other caching solutions available, I just wanted to show that implementing this isn't too complicated, and really should be standard if you host lots of websites.

I will grant that for someone who has a single Anchor site, installing varnish could be seen as excessive/unnecessary. For this type of user, the ideal solution would be to have Anchor implement it's own cache system, or save content as html in the database.

So I guess the real question is this, does Anchor plan to implement caching sometime in the future?

This could either be in the main code base, or implemented as a plugin so users can choose which caching solution they want to use. If Anchor will have it's own caching solution, it doesn't make sense to save html along with markdown in the database long term.

If Anchor won't be implementing a caching solution, would you want/expect your users to configure their own? We could just add a recommendation along with information/links in the documentation for setting up a caching proxy and opcode cache. I wouldn't mind helping to write some docs on optimizing your server for the best Anchor performance.

If you expect most users to use Anchor without caching, and don't plan to implement caching, then saving content as html for faster page rendering makes sense. If you go this route, is there a way to remove markdown from the database so you aren't storing the same information twice?

I know you could just store both markdown and html, but this really bothers me. What if I have 2000 Anchor installs on a host for various clients. I am now storing the same information twice for each site. Just thinking in terms of scale, this seems very inefficient.

ps: +1 for parsedown

@james2doyle
Copy link
Contributor Author

I think the main thing to not get caught up on, is that markdown is being used for it's ability to abstract HTML, not reduce data size. If every client (as in customer/end user) could write HTML and create perfect blockquotes/links/image tags etc., then markdown wouldn't exist and we could stick with a single "HTML" column.

I wish there was a cache in Anchor, that would make this PR less relevant. There really isn't one at the moment, and not every host provides the tools to install extensions or tweak existing ones. So things like APC and opcache would be great, but sometimes you don't have that luxury. This is the "cheapest" way to gain a 98% performance increase without breaking backwards-compatibility, or adding to/rewriting the core classes.

What if I have 2000 Anchor installs on a host for various clients.

There is always going to be a case of "but what if I have a [X] situation with a [Y] setup?". The goal of Anchor seems to be that they are shooting for the highest server coverage and best host support. If you are at the scale where you need to be running 2000 instances of something, I would suggest a system that already has things like caching and multi-site support.

I am now storing the same information twice for each site

Not quite. That data is not really the same. We are storing "uncompiled" raw markdown, and then we are storing the "compiled" results. This data is not identical. Before you had a site that was 98% slower, so there is a significant return on that additional data. We can argue the cost of SQL table columns all day. Why store the slug when you can just slugify the title column when the page data is pulled?

The point is to reduce functions and string manipulations that happen each time a page is called. It's the same thing opcache and APC are trying to do. They try to reduce the number of times a function is called when the results are going to be the same.

This approach seems to be a pretty common tactic in the world of markdown-supported CMSs. It really is a cheap way to squeak a little more performance out of the site, without too much of an overhaul.

@lifehome
Copy link

@james2doyle

Sorry for pushing this up a bit while the PR is almost ready to go, but I'm still confusing after reading this pile of comments and the PR commits.

(Please correct me if I'm wrong)

So this feature is going to store two things: "raw markdown" and "compiled html"
And this is supposedly to increase the speed on rendering the client-side page loadup.

If so, just being curious, that, why not creating a encode/decode function? Like render html and store into the database, and "decompile" it back to markdown/html-mixture/somewhat-the-content for editing? This might slow down alot/a bit at the backend, but just an idea.... (messed up mind, dunno what I'm typing... lol)

@james2doyle
Copy link
Contributor Author

So this feature is going to store two things: "raw markdown" and "compiled html". And this is supposedly to increase the speed on rendering the client-side page loadup.

Yes, you got it.

why not creating a encode/decode function? Like render html and store into the database, and "decompile" it back to markdown/html-mixture/somewhat-the-content for editing? This might slow down alot/a bit at the backend...

Well you basically gave the reason why, it would be slower.

The main thing is, and I've hammered this point enough, that the change in this PR is so minor, yet we get quite a nice performance boost without an sacrifices. I actually removed some parsing functions because they aren't needed anymore.

@hubply
Copy link

hubply commented Mar 15, 2015

I implementing rabid system for asia company and some complaint on govement , On comment store pattren, I and my team seperate to strore 3 storage (1 index, 1 raw html, 1 normal text ,and sure system are not on mysql or some sql directly (overhead hardware on lot comment) ,I store on cache (redis) first and seperate in pattren db level ),It's complexier but on my clients is fine on them.Try it.

@CraigChilds94
Copy link
Member

Update:

I'm still working on getting this implemented. Last time I pulled down the PR to my local I had some issues. It might take a bit of working out but this will be a considerable change to anchor! Again thanks for all of your hard work @james2doyle

@CraigChilds94 CraigChilds94 added this to the 0.9 milestone Apr 3, 2015
CraigChilds94 added a commit that referenced this issue Jun 26, 2015
Store both Markdown and rendered HTML. Closes #744.
@CraigChilds94
Copy link
Member

This has now been merged in! Thanks for everyones help on this! :D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants