-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: Import MediaWiki XML dump #70
Comments
Hi there! What you're asking for is quite complicated, since Mediawiki has a very rich language with many extensions. Furthermore, there's also revisions, images and attachments to consider. It could be done in a crappy way, ie. just importing article slugs, titles and the text body with a simple conversion. The result would be that a manual rework would have to be done after. And with regards to images... that would just be really complicated. |
Hi :) While I realize, that this is a complicated feature, it would be incredibly nice to have. Even though it would just copy text and not include attachments (such as images), just the part where you'd fetch all the pages and maintain the links between them, would be very valuable. At least to me it would. I'm looking for a replacement for my company's MediaWiki, and your project seems like a great candidate. While I understand that this import feature is not part of your main concern with this project, I would certainly find it useful. |
I'm currently working on this project: https://github.com/benjaoming/python-mwdump-tools It might be of interest for you as it gives you a pretty simple XmlParser from which you can extend |
Also check #275 :) |
Hey friends I wanted to bump this a bit since last update on this issue is 10 years ago (!) I've got a mediawiki wiki that I need to import to django-wiki. I don't care about revisions/history nor images, I only want to insert the current pages of the mediawiki wiki. Is there a way to do that ? I've seen in the project's history that there was a management command that could be used but this management command was removed (!) instead of fixed ? Thank you |
@spapas you can still try to use some of the code in #275 for your own project (it doesn't live in django-wiki currently because it lacked tests and probably broke). The quickest road to success is likely to make this work in your own project and do exactly the customizations that you need without worrying about universal use-cases. |
Hey friends, using the code in #275 as a basis, I implemented a simple management command that should import from a mediawiki xml dump and works with latest django-wiki version and latest mediawiki version. It needs lxmlto parse the mediawiki xml dump and unidecode to convert non-latin characters to ascii. It uses pandoc to do the actual markdown -> md convert (I have tested it on windows and it works great). Put the following on your management commands folder and run it like from django.core.management.base import BaseCommand
from wiki.models.article import ArticleRevision, Article
from wiki.models.urlpath import URLPath
from django.contrib.sites.models import Site
from django.template.defaultfilters import slugify
import unidecode
from django.contrib.auth import get_user_model
import datetime
import pytz
from django.db import transaction
import subprocess
from lxml import etree
def slugify2(s):
return slugify(unidecode.unidecode(s))
def convert_to_markdown(text):
proc = subprocess.Popen(
["pandoc", "-f", "mediawiki", "-t", "gfm"],
stdout=subprocess.PIPE,
stdin=subprocess.PIPE,
)
proc.stdin.write(text.encode("utf-8"))
proc.stdin.close()
return proc.stdout.read().decode("utf-8")
def create_article(title, text, timestamp, user):
text_ok = (
text.replace("__NOEDITSECTION__", "")
.replace("__NOTOC__", "")
.replace("__TOC__", "")
)
text_ok = convert_to_markdown(text_ok)
article = Article()
article_revision = ArticleRevision()
article_revision.content = text_ok
article_revision.title = title
article_revision.user = user
article_revision.owner = user
article_revision.created = timestamp
article.add_revision(article_revision, save=True)
article_revision.save()
article.save()
return article
def create_article_url(article, slug, current_site, url_root):
upath = URLPath.objects.create(
site=current_site, parent=url_root, slug=slug, article=article
)
article.add_object_relation(upath)
def import_page(current_site, url_root, text, title, timestamp, replace_existing, user):
slug = slugify2(title)
try:
urlp = URLPath.objects.get(slug=slug)
if not replace_existing:
print("\tAlready existing, skipping...")
return
print("\tDestorying old version of the article")
urlp.article.delete()
except URLPath.DoesNotExist:
pass
article = create_article(title, text, timestamp, user)
create_article_url(article, slug, current_site, url_root)
class Command(BaseCommand):
help = "Import everything from a MediaWiki XML dump file. Only the latest version of each page is imported."
args = ""
articles_worked_on = []
articles_imported = []
matching_old_link_new_link = {}
def add_arguments(self, parser):
parser.add_argument("file", type=str)
@transaction.atomic()
def handle(self, *args, **options):
user = get_user_model().objects.get(username="spapas")
current_site = Site.objects.get_current()
url_root = URLPath.root()
tree = etree.parse(options["file"])
pages = tree.xpath('// *[local-name()="page"]')
for p in pages:
title = p.xpath('*[local-name()="title"]')[0].text
print(title)
revision = p.xpath('*[local-name()="revision"]')[0]
text = revision.xpath('*[local-name()="text"]')[-1].text
timestamp = revision.xpath('*[local-name()="timestamp"]')[0].text
timestamp = datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
timestamp_with_timezone = pytz.utc.localize(timestamp)
import_page(
current_site,
url_root,
text,
title,
timestamp_with_timezone,
True,
user,
) Please notice that this tries to find an |
Thanks for sharing! I can only imagine that this is the perfect kind of boilerplate for someone to get started. Actually, it could fit very well in the documentation as a copy-paste example. |
I'll try to add a PR on the docs for that! |
@spapas it would fit well next to the How-To about Disquss comments: https://django-wiki.readthedocs.io/en/main/tips/index.html (but the docs will be restructured, so don't worry too much about the location) |
Do you have any plugin or extension to import XML dump from a wiki manage by MediaWiki engine?
It will be a nice feature to help to migrate to your engine.
The text was updated successfully, but these errors were encountered: