New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pelican-import corrupts posts imported from WordPress #2255

Closed
afiskon opened this Issue Nov 29, 2017 · 11 comments

Comments

Projects
None yet
8 participants
@afiskon

afiskon commented Nov 29, 2017

Hello,

I tried to move my blog ( https://eax.me/ ) from WordPress to Pelican using pelican-import. It works for the most part. There were a few difficulties but I managed to find workarounds for most of them. Particularly pelican-import doesn't handle images well and doesn't support CodeColorer plugin so I had to use some regular expressions (see below).

Also I discovered that pelican-import doesn't work with pandoc 2.0.2 properly so I had to patch the script around line 729:

            parse_raw = ''
            cmd = ('pandoc {0} --from=html'
                   ' --to={1}+raw_html -o "{2}" "{3}"')
            cmd = cmd.format(parse_raw, out_markup,
                             out_filename, html_filename)

These are minor issues though. The most serious issues are the following.

  1. After migration characters like ', " and $ were replaced to \', \" and \$. I can't just replace these sequences back to ', " and $ since these sequences sometimes are used in code snippets. There is a similar issue with the dash symbol that after migrations turn from --- to \-\--.

  2. Also in some code snippets code like #include <something.h> just turns into #include.

  3. Last but not least one of the posts created by pelican-import (see below) hangs make devserver with 100% CPU usage.

You can reproduce all these issues using the XML file that I exported from WordPress. Please leave your email and I will send it to you.

Exact steps to reproduce:

perl -pi -e 's/\[cci.*?\]/<code>/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/\[\/cci.*?\]/<\/code>/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/\[cc_(\w+).*?\]/<pre><code>:::$1/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/\[cc.*?lang="(\w+)".*?\]/<pre><code>:::$1/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/\[cc.*?\]/<pre><code>/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/\[\/cc.*?\]/<\/code><\/pre>/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/<img .*?((src=".*?" alt=".*?")|(alt=".*?" src=".*?)).*?>/<img $1 \/>/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's/<span style="white\-space: nowrap;">(.*?)<\/span>/$1/g' ~/temp/del-me/wordpress.2017-11-29.xml
perl -pi -e 's!"https?:\/\/eax\.me\/([a-z0-9\-_]+)\/"!"https:\/\/eax.me\/$1.html"!g' ~/temp/del-me/wordpress.2017-11-29.xml
pelican-import -o content/ -m markdown --wpfile ~/temp/del-me/wordpress.2017-11-29.xml

Run make devserver to make sure that content/mojolicious.md hangs the server (issue 3). Temporary replace it somewhere and run make devserver again.

Then see:

http://localhost:8000/diy-presentation-remote.html
http://localhost:8000/elliptic-curves-crypto.html

^ symbols $, ' and " turned into \$, \' and \" (issue 1).

http://localhost:8000/cpp-gtest.html

^ #include <something.h> was replaced to #include (issue 2).

Pelican version is 3.7.1.

@waura

This comment has been minimized.

Show comment
Hide comment
@waura

waura Dec 23, 2017

I maybe have the same trouble as you had.
My environment is as follows.
pelican 3.7.1
pandoc 2.0.5

When I run the following command, an error occurred.

$ pelican-import --wpfile -o ./output ./wp.xml
...
--normalize has been removed.  Normalization is now automatic.
--parse-raw/-R has been removed. Use +raw_html or +raw_tex extension.

Try pandoc --help for more information.
Please, check your Pandoc installation.

It seems that Pandoc's arguments have changed.
In order to avoid these error, you patched the script around line 729, didn't you?

waura commented Dec 23, 2017

I maybe have the same trouble as you had.
My environment is as follows.
pelican 3.7.1
pandoc 2.0.5

When I run the following command, an error occurred.

$ pelican-import --wpfile -o ./output ./wp.xml
...
--normalize has been removed.  Normalization is now automatic.
--parse-raw/-R has been removed. Use +raw_html or +raw_tex extension.

Try pandoc --help for more information.
Please, check your Pandoc installation.

It seems that Pandoc's arguments have changed.
In order to avoid these error, you patched the script around line 729, didn't you?

@afiskon

This comment has been minimized.

Show comment
Hide comment
@afiskon

afiskon commented Dec 24, 2017

Right.

@colmoneill

This comment has been minimized.

Show comment
Hide comment
@colmoneill

colmoneill Jan 11, 2018

Just ran into the very same error messages as @waura
pelican=3.7.1
Pandoc 2.0.6

$ pelican-import --wpfile -o output2/ osp-blog.wordpress.2017-10-10.xml 
output2/-.rst
--normalize has been removed.  Normalization is now automatic.
--parse-raw/-R has been removed. Use +raw_html or +raw_tex extension.

Try pandoc --help for more information.
Please, check your Pandoc installation.

colmoneill commented Jan 11, 2018

Just ran into the very same error messages as @waura
pelican=3.7.1
Pandoc 2.0.6

$ pelican-import --wpfile -o output2/ osp-blog.wordpress.2017-10-10.xml 
output2/-.rst
--normalize has been removed.  Normalization is now automatic.
--parse-raw/-R has been removed. Use +raw_html or +raw_tex extension.

Try pandoc --help for more information.
Please, check your Pandoc installation.
@colmoneill

This comment has been minimized.

Show comment
Hide comment
@colmoneill

colmoneill Jan 18, 2018

No news on this issue? Seems simply like Pandoc arguments have changed, any idea what version of pandoc we could downgrate do to get this working again?

Thanks!

colmoneill commented Jan 18, 2018

No news on this issue? Seems simply like Pandoc arguments have changed, any idea what version of pandoc we could downgrate do to get this working again?

Thanks!

@colmoneill

This comment has been minimized.

Show comment
Hide comment
@colmoneill

colmoneill Jan 18, 2018

I was personally able to get my xml export to convert to markdown, although both @afiskon and @waura seem to desire .rst

This is what I did to lines 728 - 732

            parse_raw = '' if not strip_raw else ''
            cmd = ('pandoc {0} --from=html'
                   ' --to=gfm+raw_html -o "{2}" "{3}"')
            cmd = cmd.format(parse_raw, out_markup,
                             out_filename, html_filename)

And this let me import, convert and works ok with the other pelican import args like --dir-cat & --dir-page

Hope this helps in solving the issue for .rst convertions.
Cheers

colmoneill commented Jan 18, 2018

I was personally able to get my xml export to convert to markdown, although both @afiskon and @waura seem to desire .rst

This is what I did to lines 728 - 732

            parse_raw = '' if not strip_raw else ''
            cmd = ('pandoc {0} --from=html'
                   ' --to=gfm+raw_html -o "{2}" "{3}"')
            cmd = cmd.format(parse_raw, out_markup,
                             out_filename, html_filename)

And this let me import, convert and works ok with the other pelican import args like --dir-cat & --dir-page

Hope this helps in solving the issue for .rst convertions.
Cheers

davidwilemski added a commit to davidwilemski/pelican that referenced this issue Jan 29, 2018

davidwilemski added a commit to davidwilemski/pelican that referenced this issue Jan 29, 2018

@mbbender

This comment has been minimized.

Show comment
Hide comment
@mbbender

mbbender Feb 2, 2018

I just hit this issue too.

mbbender commented Feb 2, 2018

I just hit this issue too.

@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Feb 8, 2018

Member

@davidwilemski: It seems you might have implemented a workaround for this issue. Would you consider submitting a pull request so that others don't run into this problem?

Member

justinmayer commented Feb 8, 2018

@davidwilemski: It seems you might have implemented a workaround for this issue. Would you consider submitting a pull request so that others don't run into this problem?

davidwilemski added a commit to davidwilemski/pelican that referenced this issue Feb 11, 2018

@davidwilemski

This comment has been minimized.

Show comment
Hide comment
@davidwilemski

davidwilemski Feb 11, 2018

@justinmayer, sure, I've opened up #2289 to get the ball rolling.

davidwilemski commented Feb 11, 2018

@justinmayer, sure, I've opened up #2289 to get the ball rolling.

@maschinetheist

This comment has been minimized.

Show comment
Hide comment
@maschinetheist

maschinetheist May 13, 2018

I hit this issue as well. The solution that @afiskon showed helped.

maschinetheist commented May 13, 2018

I hit this issue as well. The solution that @afiskon showed helped.

@davidag

This comment has been minimized.

Show comment
Hide comment
@davidag

davidag Aug 3, 2018

Contributor

Let me recap the status of the three problems reported by @afiskon in this issue:

  1. Escaped characters: This was due to pandoc2 default usage of smart quotes on Markdown export. More info on pandoc manual. Fixed in #2366

  2. Disappearing source code: The problem is that the contents of <pre><code></code></pre> are not escaped. Because @afiskon is using the CodeColorer plugin which uses special tags, WordPress is not aware that the contents of [cc][/cc] must be escaped when exported to XML. I've tested that using the core WordPress editor, source code with special chars is correctly escaped and this issue doesn't reproduce. In my opinion, pelican-import should support only "valid" XML files.

  3. 100% CPU hang: This is similar to the previous one, but with different consequences. The origin cause is an invalid html tag <a href="#i); inside <pre><code></code></pre> that leads pandoc to truncate a part of the post content (again, because < was not correctly escaped in the XML). The resulting markdown file produces the hang in pelican, which seems indeed a bug. I conclude that pelican-import is not the problem here, so I propose that a new issue is opened to deal with the hang.

@justinmayer: Given the previous analysis, I think that this issue can be closed after pr #2366 is merged.
@afiskon Thank you for the report and for your help in providing the original file!

Contributor

davidag commented Aug 3, 2018

Let me recap the status of the three problems reported by @afiskon in this issue:

  1. Escaped characters: This was due to pandoc2 default usage of smart quotes on Markdown export. More info on pandoc manual. Fixed in #2366

  2. Disappearing source code: The problem is that the contents of <pre><code></code></pre> are not escaped. Because @afiskon is using the CodeColorer plugin which uses special tags, WordPress is not aware that the contents of [cc][/cc] must be escaped when exported to XML. I've tested that using the core WordPress editor, source code with special chars is correctly escaped and this issue doesn't reproduce. In my opinion, pelican-import should support only "valid" XML files.

  3. 100% CPU hang: This is similar to the previous one, but with different consequences. The origin cause is an invalid html tag <a href="#i); inside <pre><code></code></pre> that leads pandoc to truncate a part of the post content (again, because < was not correctly escaped in the XML). The resulting markdown file produces the hang in pelican, which seems indeed a bug. I conclude that pelican-import is not the problem here, so I propose that a new issue is opened to deal with the hang.

@justinmayer: Given the previous analysis, I think that this issue can be closed after pr #2366 is merged.
@afiskon Thank you for the report and for your help in providing the original file!

@justinmayer justinmayer added this to the 3.8.0 milestone Aug 4, 2018

justinmayer added a commit that referenced this issue Aug 4, 2018

Merge pull request #2366 from davidag/pandoc2-import-support
Add pandoc2 support to pelican-import. Fix #2255
@justinmayer

This comment has been minimized.

Show comment
Hide comment
@justinmayer

justinmayer Aug 4, 2018

Member

Associated fix by @davidag has been merged and will be included in Pelican 3.8 when released. Many thanks, David, for addressing this issue! 😁

Member

justinmayer commented Aug 4, 2018

Associated fix by @davidag has been merged and will be included in Pelican 3.8 when released. Many thanks, David, for addressing this issue! 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment