New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Importer Speed Optimisations #590

Closed
balupton opened this Issue Aug 9, 2013 · 3 comments

Comments

@balupton
Member

balupton commented Aug 9, 2013

Importing 500 tumblr posts takes a very very long time due to the nature of the API. Currently this means that for the initial generation for that import, it will take about 3 minutes until you can access your site and start hacking. We need a better way of doing this.

Here's some options.

Defer importing. What we could do is just defer the loading of tumblr articles until later. So we do an initial quick generation without importing anything, then do another generation once everything is imported.

Lazy importing. Another option is we defer the importing, and then generate as we import new things. Perhaps even have the importer constantly be listening to new things (would be awesome for a pub/sub importer relationship where new documents could be created at any time and we can be notified, and import them as they come).

Caching. Another option is once we've imported things, we could use writeSource to write it to disk. Now that will only be effective in improving load times if we use defer importing, but it still incurs the same hit overall but does let us having all of the data available on the initial generation. Another option to this would be to use writeSource, as well as a way to only fetch missing data from tumblr in a way to speed up the entire process. For this to happen we will need a writeSource: once value, in order for the writeSource: true to not be added to the actual meta data of the written file.

Not sure how we could accomplish these. I'm sure it's possible, just not sure how yet, or which way is best.


Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

@jonathanmoore

This comment has been minimized.

jonathanmoore commented Aug 9, 2013

Personally I would like to see the caching route where importing the posts would just take a while the first time. We've been deeply involved with Tumblr for several years now, and it is far more common to see blogs with 5,000+ posts rather than 500.

Then after the initial archive of posts have been written, the next imports would only load and cache the newer posts. The only thing that would need to be thought through would be a way to force the entire set of posts to be imported and cached again. This would be helpful in the case where imported posts might be deleted or edited.

@ghost ghost assigned balupton Sep 7, 2013

This was referenced Sep 17, 2013

@pflannery

This comment has been minimized.

Contributor

pflannery commented Nov 3, 2013

would be nice to have command line switch like

 # run all importers
docpad import

# run individually
docpad import tumblr

When in this mode the plugin could write the imported data to the documents dir (or whatever the user specifies)

we could also provide a date from which to retrieve the posts from .i.e. anything before the date is ignored and anything >= to the date is imported. That way we don't re-import data.

This is how it could look from the command line

docpad import tumblr from={date-time}

# have friendly date constants like
docpad import tumblr from=today
docpad import tumblr from=yesterday
docpad import tumblr from=lastweek (and so on....)

I think this would also mean we need to be able to define a plugin type i.e.

name: 'tumbrl'
type: 'importer'
config: []
@balupton

This comment has been minimized.

Member

balupton commented Jun 18, 2016

As per https://discuss.bevry.me/t/deprecating-in-memory-docpad-importers-exporters/87 this issue is now outside the scope of DocPad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment