Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Import MediaWiki XML dump #70

Open
ghost opened this issue Oct 18, 2012 · 10 comments
Open

feature request: Import MediaWiki XML dump #70

ghost opened this issue Oct 18, 2012 · 10 comments
Labels
enhancement A feature request: Will be implemented if someone steps up!

Comments

@ghost
Copy link

ghost commented Oct 18, 2012

Do you have any plugin or extension to import XML dump from a wiki manage by MediaWiki engine?
It will be a nice feature to help to migrate to your engine.

@benjaoming
Copy link
Member

Hi there!

What you're asking for is quite complicated, since Mediawiki has a very rich language with many extensions. Furthermore, there's also revisions, images and attachments to consider.

It could be done in a crappy way, ie. just importing article slugs, titles and the text body with a simple conversion.

The result would be that a manual rework would have to be done after. And with regards to images... that would just be really complicated.

@eldamir
Copy link

eldamir commented Nov 19, 2012

Hi :)

While I realize, that this is a complicated feature, it would be incredibly nice to have. Even though it would just copy text and not include attachments (such as images), just the part where you'd fetch all the pages and maintain the links between them, would be very valuable. At least to me it would.

I'm looking for a replacement for my company's MediaWiki, and your project seems like a great candidate.

While I understand that this import feature is not part of your main concern with this project, I would certainly find it useful.

@benjaoming
Copy link
Member

I'm currently working on this project:

https://github.com/benjaoming/python-mwdump-tools

It might be of interest for you as it gives you a pretty simple XmlParser from which you can extend handle_page and maybe get the conversion done with some pypandoc?

@the-glu
Copy link
Contributor

the-glu commented Jun 23, 2014

Also check #275 :)

@spapas
Copy link
Contributor

spapas commented Feb 26, 2024

Hey friends I wanted to bump this a bit since last update on this issue is 10 years ago (!)

I've got a mediawiki wiki that I need to import to django-wiki. I don't care about revisions/history nor images, I only want to insert the current pages of the mediawiki wiki. Is there a way to do that ? I've seen in the project's history that there was a management command that could be used but this management command was removed (!) instead of fixed ?

Thank you

@benjaoming
Copy link
Member

@spapas you can still try to use some of the code in #275 for your own project (it doesn't live in django-wiki currently because it lacked tests and probably broke).

The quickest road to success is likely to make this work in your own project and do exactly the customizations that you need without worrying about universal use-cases.

@spapas
Copy link
Contributor

spapas commented Feb 29, 2024

Hey friends, using the code in #275 as a basis, I implemented a simple management command that should import from a mediawiki xml dump and works with latest django-wiki version and latest mediawiki version. It needs lxmlto parse the mediawiki xml dump and unidecode to convert non-latin characters to ascii. It uses pandoc to do the actual markdown -> md convert (I have tested it on windows and it works great).

Put the following on your management commands folder and run it like python manage.py import_mediawiki dump.xml

from django.core.management.base import BaseCommand
from wiki.models.article import ArticleRevision, Article
from wiki.models.urlpath import URLPath
from django.contrib.sites.models import Site
from django.template.defaultfilters import slugify
import unidecode
from django.contrib.auth import get_user_model
import datetime
import pytz
from django.db import transaction
import subprocess
from lxml import etree


def slugify2(s):
    return slugify(unidecode.unidecode(s))


def convert_to_markdown(text):
    proc = subprocess.Popen(
        ["pandoc", "-f", "mediawiki", "-t", "gfm"],
        stdout=subprocess.PIPE,
        stdin=subprocess.PIPE,
    )
    proc.stdin.write(text.encode("utf-8"))
    proc.stdin.close()
    return proc.stdout.read().decode("utf-8")


def create_article(title, text, timestamp, user):
    text_ok = (
        text.replace("__NOEDITSECTION__", "")
        .replace("__NOTOC__", "")
        .replace("__TOC__", "")
    )

    text_ok = convert_to_markdown(text_ok)

    article = Article()
    article_revision = ArticleRevision()
    article_revision.content = text_ok
    article_revision.title = title
    article_revision.user = user
    article_revision.owner = user
    article_revision.created = timestamp
    article.add_revision(article_revision, save=True)
    article_revision.save()
    article.save()
    return article


def create_article_url(article, slug, current_site, url_root):

    upath = URLPath.objects.create(
        site=current_site, parent=url_root, slug=slug, article=article
    )
    article.add_object_relation(upath)


def import_page(current_site, url_root, text, title, timestamp, replace_existing, user):
    slug = slugify2(title)

    try:
        urlp = URLPath.objects.get(slug=slug)

        if not replace_existing:
            print("\tAlready existing, skipping...")
            return

        print("\tDestorying old version of the article")
        urlp.article.delete()

    except URLPath.DoesNotExist:
        pass

    article = create_article(title, text, timestamp, user)
    create_article_url(article, slug, current_site, url_root)


class Command(BaseCommand):
    help = "Import everything from a MediaWiki XML dump file. Only the latest version of each page is imported."
    args = ""

    articles_worked_on = []
    articles_imported = []
    matching_old_link_new_link = {}

    def add_arguments(self, parser):
        parser.add_argument("file", type=str)

    @transaction.atomic()
    def handle(self, *args, **options):
        user = get_user_model().objects.get(username="spapas")
        current_site = Site.objects.get_current()
        url_root = URLPath.root()

        tree = etree.parse(options["file"])
        pages = tree.xpath('// *[local-name()="page"]')
        for p in pages:
            title = p.xpath('*[local-name()="title"]')[0].text
            print(title)
            revision = p.xpath('*[local-name()="revision"]')[0]
            text = revision.xpath('*[local-name()="text"]')[-1].text
            timestamp = revision.xpath('*[local-name()="timestamp"]')[0].text
            timestamp = datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
            timestamp_with_timezone = pytz.utc.localize(timestamp)

            import_page(
                current_site,
                url_root,
                text,
                title,
                timestamp_with_timezone,
                True,
                user,
            )

Please notice that this tries to find an spapas user to assign the owner of the pages to (you can leave that as None or add your own user). Also I haven't tested if it works fine when you've got multiple revisions of each page; it tries to pick the text of the latest one (text = revision.xpath('*[local-name()="text"]')[-1].text) but I'm not sure it will work properly. Better to be safe by including only the latest revision of each article on your mediawiki dump. Also you can pass True or False to import_page in order to replace or skip existing pages.

@benjaoming
Copy link
Member

Thanks for sharing! I can only imagine that this is the perfect kind of boilerplate for someone to get started. Actually, it could fit very well in the documentation as a copy-paste example.

@spapas
Copy link
Contributor

spapas commented Feb 29, 2024

I'll try to add a PR on the docs for that!

@benjaoming
Copy link
Member

@spapas it would fit well next to the How-To about Disquss comments: https://django-wiki.readthedocs.io/en/main/tips/index.html

(but the docs will be restructured, so don't worry too much about the location)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature request: Will be implemented if someone steps up!
Projects
None yet
Development

No branches or pull requests

4 participants