Introduction of `raw` directory that is not parsed #276

Closed
Delapouite opened this Issue Jul 24, 2012 · 38 comments

Comments

Projects
None yet
9 participants
@Delapouite
Member

Delapouite commented Jul 24, 2012

Update: See the raw plugin.

Hi

If I remember correctly, in the first versions of Docpad, only the content of the src/documents folder was processed.
The public folder (now 'files') was just straight copied to the out folder.

Now, the content of src/files is processed as well. This behavior is handy for a lot of reason like the attached plugin.
But it also slows down the all generation for nothing if we don't need to do anything special with those files.

Will it be wise to have back a specific folder in the src (src/clone?) which content is just recursively copied to out without funky backbone collections adding and stuff?

I know it may ends up in confusion and maybe it only deserves a plugin that hook on of of the final event like serverBefore to make this operation.

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev Jul 26, 2012

I vote for this, i used to have a lot of files in files(public) folder but since it is processed nowadays and it really slows things down in my case from 5-6sec to 2min i just created what it written above in bash with the help of rsync :) and it gets executed after docpad has finished with generation. I put all those files in folder outside src folder in order not to mess up with docpad structure. It would be nice to have that king of data/public/files/clone folder back to docpad :)

I vote for this, i used to have a lot of files in files(public) folder but since it is processed nowadays and it really slows things down in my case from 5-6sec to 2min i just created what it written above in bash with the help of rsync :) and it gets executed after docpad has finished with generation. I put all those files in folder outside src folder in order not to mess up with docpad structure. It would be nice to have that king of data/public/files/clone folder back to docpad :)

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Jul 27, 2012

Member

Hrmmm, interesting.

I'm up for re-adding it, though I'd like to keep it as a power-user feature rather than something taught to everyone - as it does add more complexity to the decision making process of "where do I put my file" - something we I feel even with two directories there is still too much complexity!

Perhaps we can call that directory raw as in we don't do anything special to it?

Member

balupton commented Jul 27, 2012

Hrmmm, interesting.

I'm up for re-adding it, though I'd like to keep it as a power-user feature rather than something taught to everyone - as it does add more complexity to the decision making process of "where do I put my file" - something we I feel even with two directories there is still too much complexity!

Perhaps we can call that directory raw as in we don't do anything special to it?

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev Jul 30, 2012

raw is nice name

raw is nice name

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Oct 24, 2012

Member

With large files, this is definitely becoming an issue. There are a few options for implementation here:

  1. Use the operating system copy abilities - fastest, but we cannot detect which files have changed, meaning that we copy everything over regardless if they exist, or don't copy over files that already exist
  2. Use node.js for copying - slower, but we can detect which files have changed since last time, and copy only those that have changed
  3. Hybrid between the two - medium, use node.js for detecting file changes, however use OS for copying changed files

I'd like to get your feedback on this before I proceed.

Member

balupton commented Oct 24, 2012

With large files, this is definitely becoming an issue. There are a few options for implementation here:

  1. Use the operating system copy abilities - fastest, but we cannot detect which files have changed, meaning that we copy everything over regardless if they exist, or don't copy over files that already exist
  2. Use node.js for copying - slower, but we can detect which files have changed since last time, and copy only those that have changed
  3. Hybrid between the two - medium, use node.js for detecting file changes, however use OS for copying changed files

I'd like to get your feedback on this before I proceed.

@zenorocha

This comment has been minimized.

Show comment
Hide comment
@zenorocha

zenorocha Oct 24, 2012

I'm looking forward for this feature. Nowadays I have a 500mb JavaScript library in my project and this slows down a lot my generation process.

I'm looking forward for this feature. Nowadays I have a 500mb JavaScript library in my project and this slows down a lot my generation process.

@zenorocha

This comment has been minimized.

Show comment
Hide comment
@zenorocha

zenorocha Oct 24, 2012

Btw, the hybrid alternative seems better.

Btw, the hybrid alternative seems better.

@djalmaaraujo

This comment has been minimized.

Show comment
Hide comment

+1

@eduardolundgren

This comment has been minimized.

Show comment
Hide comment
@eduardolundgren

eduardolundgren Oct 25, 2012

I need that too

I need that too

@cirdes

This comment has been minimized.

Show comment
Hide comment

cirdes commented Nov 11, 2012

+1

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Nov 13, 2012

Member

Considering the interest, I'll look into getting this up in the next batch of work.

Will probably just go with option 1 at the start, as that will provide the least overhead and the quickest solution which is what the raw directory is all about.

For the meantime, you could run npm install --save bal-util then add the following to your docpad configuration file:

    # =================================
    # DocPad Events

    events:

        # Copy over our Raw Directory when DocPad Starts Up
        docpadReady: (opts,next) ->
            # Prepare
            docpad = @docpad
            config = docpad.getConfig()
            balUtil = require('bal-util')
            rawPath = require('pathUtil').join(config.srcPath, 'raw', '*')

            command = ['cp', '-Rn', rawPath, config.outPath]
            docpad.log('info', 'raw directory copying over...')
            balUtil.spawn command, {output:true}, (err) ->
                return next(err)  if err
                docpad.log('info', 'raw directory copied over')
                return next()

and that should work (haven't test the above code snippet, so could be incorrect but won't be far off).

Actually, if someone wants to make that a plugin, that'll be awesome!

Member

balupton commented Nov 13, 2012

Considering the interest, I'll look into getting this up in the next batch of work.

Will probably just go with option 1 at the start, as that will provide the least overhead and the quickest solution which is what the raw directory is all about.

For the meantime, you could run npm install --save bal-util then add the following to your docpad configuration file:

    # =================================
    # DocPad Events

    events:

        # Copy over our Raw Directory when DocPad Starts Up
        docpadReady: (opts,next) ->
            # Prepare
            docpad = @docpad
            config = docpad.getConfig()
            balUtil = require('bal-util')
            rawPath = require('pathUtil').join(config.srcPath, 'raw', '*')

            command = ['cp', '-Rn', rawPath, config.outPath]
            docpad.log('info', 'raw directory copying over...')
            balUtil.spawn command, {output:true}, (err) ->
                return next(err)  if err
                docpad.log('info', 'raw directory copied over')
                return next()

and that should work (haven't test the above code snippet, so could be incorrect but won't be far off).

Actually, if someone wants to make that a plugin, that'll be awesome!

@zenorocha

This comment has been minimized.

Show comment
Hide comment
@zenorocha

zenorocha Nov 14, 2012

Thanks for looking into this, I've tried to use this snippet and I got:

error: An error occured: exited with a non-zero status code Error: exited with a non-zero status code

I'll try to make it work.

Thanks for looking into this, I've tried to use this snippet and I got:

error: An error occured: exited with a non-zero status code Error: exited with a non-zero status code

I'll try to make it work.

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Nov 14, 2012

Member

To debug, try this:

            balUtil.spawn command, (err,args...) ->
                if err
                    console.log(args)
                    return next(err)
Member

balupton commented Nov 14, 2012

To debug, try this:

            balUtil.spawn command, (err,args...) ->
                if err
                    console.log(args)
                    return next(err)
@zenorocha

This comment has been minimized.

Show comment
Hide comment
@zenorocha

zenorocha Nov 22, 2012

I see that there's a ignorePattern attribute on docpad.coffee, this thing can help? Where can I find documentation for that?

I see that there's a ignorePattern attribute on docpad.coffee, this thing can help? Where can I find documentation for that?

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Apr 9, 2013

Member

Any chance of someone being thrilled enough to make this into a plugin? Would love it if someone could!

Member

balupton commented Apr 9, 2013

Any chance of someone being thrilled enough to make this into a plugin? Would love it if someone could!

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed Apr 13, 2013

FYI. I have some TSV data files that are getting mangled before being copied to the out directory. I either need a raw directory as discussed or some way to prevent docpad from converting tabs to spaces in data files. This seams like a common use case not necessary a power user feature requiring a plugin but that's just my opinion FWIW.

FYI. I have some TSV data files that are getting mangled before being copied to the out directory. I either need a raw directory as discussed or some way to prevent docpad from converting tabs to spaces in data files. This seams like a common use case not necessary a power user feature requiring a plugin but that's just my opinion FWIW.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed Apr 23, 2013

It also appears that d3js JavaScript library files (https://github.com/mbostock/d3/blob/master/d3.js) are unusable after processed from the files directory. Is it just me?

It also appears that d3js JavaScript library files (https://github.com/mbostock/d3/blob/master/d3.js) are unusable after processed from the files directory. Is it just me?

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 10, 2013

So, issue #491 fixed my problem with tabs in tab delimited files. However, I still in some cases need a /raw directory for large data files. When I place a 427mb file in /files I get the following error:

FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory

I guess I still don't understand the technical reason Docpad is altering the /files files rather than just coping them.

So, issue #491 fixed my problem with tabs in tab delimited files. However, I still in some cases need a /raw directory for large data files. When I place a 427mb file in /files I get the following error:

FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory

I guess I still don't understand the technical reason Docpad is altering the /files files rather than just coping them.

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton May 10, 2013

Member

DocPad doesn't alter files at all and just copies them across. The issue is that the way this happens in DocPad is load the files into memory, then write them. Hence why we run out of memory when doing so for large projects.

We need to load them into memory as documents may want to query them. Therefore, the solution is to have a new raw directory.

Just tested the following locally and it worked great:

module.exports =

    events:

        # Copy over our raw directory
        writeAfter: (opts,next) ->
            # Prepare
            docpad = @docpad
            config = docpad.getConfig()
            balUtil = require('bal-util')
            rawPath = config.srcPath+'/raw/'
            # the trailing / indicates to cp that the files of this directory should be copied over
            # rather than the directory itself

            command = ['cp', '-Rn', rawPath, config.outPath]
            docpad.log('debug', 'Copying raw directory')
            balUtil.spawn command, {output:true}, (err) ->
                # return next(err)  if err
                docpad.log('debug', 'Copied raw directory')
                return next()

If we can move that to a plugin and add some tests, I'll be a happy man.

Member

balupton commented May 10, 2013

DocPad doesn't alter files at all and just copies them across. The issue is that the way this happens in DocPad is load the files into memory, then write them. Hence why we run out of memory when doing so for large projects.

We need to load them into memory as documents may want to query them. Therefore, the solution is to have a new raw directory.

Just tested the following locally and it worked great:

module.exports =

    events:

        # Copy over our raw directory
        writeAfter: (opts,next) ->
            # Prepare
            docpad = @docpad
            config = docpad.getConfig()
            balUtil = require('bal-util')
            rawPath = config.srcPath+'/raw/'
            # the trailing / indicates to cp that the files of this directory should be copied over
            # rather than the directory itself

            command = ['cp', '-Rn', rawPath, config.outPath]
            docpad.log('debug', 'Copying raw directory')
            balUtil.spawn command, {output:true}, (err) ->
                # return next(err)  if err
                docpad.log('debug', 'Copied raw directory')
                return next()

If we can move that to a plugin and add some tests, I'll be a happy man.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 10, 2013

Is it possible to change the way docpad copies the files (using streams for example) to avoid the memory issue? I have to so something similar while processing these large files in my custom route.

I'd love to tackle making this a plugin but may take me some time (I'm fairly new to the whole node.js/docpad system).

By the way. The command needs to be:

command = ['cp', '-Rn', rawPath, config.outPath+'\']

or the copy will fail if the out directory doesn't already exist.

Is it possible to change the way docpad copies the files (using streams for example) to avoid the memory issue? I have to so something similar while processing these large files in my custom route.

I'd love to tackle making this a plugin but may take me some time (I'm fairly new to the whole node.js/docpad system).

By the way. The command needs to be:

command = ['cp', '-Rn', rawPath, config.outPath+'\']

or the copy will fail if the out directory doesn't already exist.

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev May 10, 2013

I believe that simply coping the files will not work when there are thousands of files in the raw folder (we do have projects with more than 10k files). It will make the whole generation process extremely slow. It should work more like rsync or at leas do a "date modified" comparison before coping the file.

I believe that simply coping the files will not work when there are thousands of files in the raw folder (we do have projects with more than 10k files). It will make the whole generation process extremely slow. It should work more like rsync or at leas do a "date modified" comparison before coping the file.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 10, 2013

Actually the -n (--no-clobber) flag in the cp command will prevent the files from being overwritten so the files will only be copied once. However, if the files change a docpad clean would be needed. An rsync would be better.

Actually the -n (--no-clobber) flag in the cp command will prevent the files from being overwritten so the files will only be copied once. However, if the files change a docpad clean would be needed. An rsync would be better.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 10, 2013

Ok... I've got a repo. I couldn't figure out the tests. Maybe next week.

https://github.com/Hypercubed/docpad-plugin-raw

Ok... I've got a repo. I couldn't figure out the tests. Maybe next week.

https://github.com/Hypercubed/docpad-plugin-raw

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 10, 2013

Well, a real test is going to be hard. The tester also has out of memory issue when comparing test/out/ to test/out-expected/.

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Well, a real test is going to be hard. The tester also has out of memory issue when comparing test/out/ to test/out-expected/.

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton May 10, 2013

Member

@Hypercubed haha, okay, I wonder if there is some native command we can use to compare the two directories... but perhaps it's time we start doing streams based comparisons.... happy to accept pull requests and discuss if anything needs to be done

Member

balupton commented May 10, 2013

@Hypercubed haha, okay, I wonder if there is some native command we can use to compare the two directories... but perhaps it's time we start doing streams based comparisons.... happy to accept pull requests and discuss if anything needs to be done

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 11, 2013

Does anyone think it is a good/bad idea to use a hard link instead of copy:

command = ['cp', '-Rnl', rawPath, config.outPath+'\']

It seams to work and is much faster.

Does anyone think it is a good/bad idea to use a hard link instead of copy:

command = ['cp', '-Rnl', rawPath, config.outPath+'\']

It seams to work and is much faster.

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev May 11, 2013

Great idea, but what about windows support? Its not my problem no windows
here but i believe there are many people doing dev on windows

Great idea, but what about windows support? Its not my problem no windows
here but i believe there are many people doing dev on windows

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 11, 2013

I'm developing using cygwin under windows and the cp -Rnl command works great. I think for dos we would need to use another command entirely.

I'm developing using cygwin under windows and the cp -Rnl command works great. I think for dos we would need to use another command entirely.

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev May 12, 2013

Cygwin is an option but it will add another requirement for running docpad under windows. The solution might to put it as configuration in docpad.cson. This way i can use rsync or any other external application. Thinking about it i am getting closer to the idea of having simple plugin that does just that :)

Cygwin is an option but it will add another requirement for running docpad under windows. The solution might to put it as configuration in docpad.cson. This way i can use rsync or any other external application. Thinking about it i am getting closer to the idea of having simple plugin that does just that :)

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton May 12, 2013

Member

Sure, so the raw plugin can be rewritten to use a node.js solution that uses streams for cross-platform compatibility, however that will increase the time needed to do it's task. The native executables for the operating system are always way faster.

How does rsync work?

Member

balupton commented May 12, 2013

Sure, so the raw plugin can be rewritten to use a node.js solution that uses streams for cross-platform compatibility, however that will increase the time needed to do it's task. The native executables for the operating system are always way faster.

How does rsync work?

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 13, 2013

Ok, I've pushed a version that uses xcopy \e instead for cp on windows without cygwin in the path. You can also change the command (i.g. rsync) by setting the plugins.raw.command in docpadConfig. This is all seaming more and more hackish by the day but is working great for my >1 Gb files.

Ok, I've pushed a version that uses xcopy \e instead for cp on windows without cygwin in the path. You can also change the command (i.g. rsync) by setting the plugins.raw.command in docpadConfig. This is all seaming more and more hackish by the day but is working great for my >1 Gb files.

@dimitarkolev

This comment has been minimized.

Show comment
Hide comment
@dimitarkolev

dimitarkolev May 13, 2013

Here is my rsync command rsync --exclude ".svn" -a ./src/raw/ ./out/ using rsync solves many problems ill try to list some

  1. Exclude source control folders, having them in out might lead to problems
  2. Very fast for thousands of small files that change from time to time coping only those that has been changed
  3. When working with large files only differences are copied (rsync has its ingenious way to do it) which saves a lot of time
  4. Available everywhere (on Windows with cygwin)

As a side not we can add -d option which will delete the the files that are no longer in raw folder but its not default behavior of docpad so i dont use it.

Here is my rsync command rsync --exclude ".svn" -a ./src/raw/ ./out/ using rsync solves many problems ill try to list some

  1. Exclude source control folders, having them in out might lead to problems
  2. Very fast for thousands of small files that change from time to time coping only those that has been changed
  3. When working with large files only differences are copied (rsync has its ingenious way to do it) which saves a lot of time
  4. Available everywhere (on Windows with cygwin)

As a side not we can add -d option which will delete the the files that are no longer in raw folder but its not default behavior of docpad so i dont use it.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed May 15, 2013

@dimitarkolev Try the plugin with this in your config file:

plugins:
    raw:
        commands:
            raw: ['rsync', '-a', './src/raw/', './out/' ]

@dimitarkolev Try the plugin with this in your config file:

plugins:
    raw:
        commands:
            raw: ['rsync', '-a', './src/raw/', './out/' ]
@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Jun 25, 2013

Member

For what it's worth, here is a gist that rsyncs for deployment: https://gist.github.com/Hypercubed/5804999

Member

balupton commented Jun 25, 2013

For what it's worth, here is a gist that rsyncs for deployment: https://gist.github.com/Hypercubed/5804999

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Jul 12, 2013

Member

@Hypercubed happy to make your raw plugin an official plugin on the docpad org. That means we'll test it with each new docpad version. Let me know if you're down for that.

Member

balupton commented Jul 12, 2013

@Hypercubed happy to make your raw plugin an official plugin on the docpad org. That means we'll test it with each new docpad version. Let me know if you're down for that.

@Hypercubed

This comment has been minimized.

Show comment
Hide comment
@Hypercubed

Hypercubed Jul 15, 2013

@balupton I don't mind. Although I'm still not completely satisfies with this solution because it relies shell commands but I don't have anything better right now.

@balupton I don't mind. Although I'm still not completely satisfies with this solution because it relies shell commands but I don't have anything better right now.

@balupton

This comment has been minimized.

Show comment
Hide comment
@balupton

balupton Sep 27, 2013

Member

I guess the decision here, is should we proceed with an execute command based approach, or a node.js stream pipe approach.

Member

balupton commented Sep 27, 2013

I guess the decision here, is should we proceed with an execute command based approach, or a node.js stream pipe approach.

@balupton

This comment has been minimized.

Show comment
Hide comment
Member

balupton commented Nov 28, 2013

@balupton balupton closed this Nov 28, 2013

@pflannery

This comment has been minimized.

Show comment
Hide comment
@pflannery

pflannery Nov 28, 2013

Contributor

There is also a plugin for running bash, sh, cmd or powershell scripts as documents or during specified docpad events.

Contributor

pflannery commented Nov 28, 2013

There is also a plugin for running bash, sh, cmd or powershell scripts as documents or during specified docpad events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment