Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update miraheze.org list without dead wikis #465

Closed
nemobis opened this issue Jun 16, 2023 · 18 comments
Closed

Update miraheze.org list without dead wikis #465

nemobis opened this issue Jun 16, 2023 · 18 comments

Comments

@nemobis
Copy link
Member

nemobis commented Jun 16, 2023

I've used the current version of miraheze-spider.py to update the list of wikis: 40a1f35

There were thousands of deletions and additions. Were there really so many wikis deleted and created?

We also need a stricter mode which would iterate through all the results and remove those which respond with an HTTP 404, like https://crystalsmp.miraheze.org/ .

@nemobis
Copy link
Member Author

nemobis commented Jun 16, 2023

@nemobis
Copy link
Member Author

nemobis commented Jun 16, 2023

The list also contains closed wikis which were made private (?):

Analysing https://trimirdi.miraheze.org/w/api.php                                                                     
Trying generating a new dump into a new directory...                                                                  
Loading page titles from namespaces = all                                                                             
Excluding titles from namespaces = None                                                                               
Error: could not get namespaces from the API request.
HTTP 200                                      
{"error":{"code":"readapidenied","info":"You need read permission to use this module.","*":"See https://trimirdi.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.o
rg/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"mw131"}

https://trimirdi.miraheze.org/wiki/Main_Page

This wiki has been automatically closed because there have been no edits or log actions made within the last 60 days. Since this wiki is private, it cannot be reopened by any user through the normal reopening request process. If this wiki is not reopened within 6 months, it may be deleted. Note: If you are a bureaucrat on this wiki, you can go to Special:ManageWiki and uncheck the "Closed" box to reopen it.

@nemobis
Copy link
Member Author

nemobis commented Jun 16, 2023

https://mario.miraheze.org redirects to https://mariopedia.org , which causes some confusion

Titles saved at... mariomirahezeorg_w-20230616-titles.txt
15091 page titles loaded
https://mario.miraheze.org/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
46 namespaces found          
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Did not get a valid JSON response from the server. Check that you used the correct hostname. If you did, the server might be wrongly configured or experiencing temporary problems.
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each

Better have only the final domain in the list.

@nemobis
Copy link
Member Author

nemobis commented Jun 16, 2023

Some of the domain names don't even resolve

Checking API... https://it.famepedia.org/w/api.php
Connection error: HTTPSConnectionPool(host='it.famepedia.org', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9fde82fb50>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Start retry attempt 2 in 20 seconds.
Checking API... https://it.famepedia.org/w/api.php

@bkil
Copy link

bkil commented Jun 16, 2023

According to this comment, they have lost 25% of their wikis due to a hard drive failure in November 2022:
https://news.ycombinator.com/item?id=36363433

@bkil
Copy link

bkil commented Jun 16, 2023

And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.

@nemobis
Copy link
Member Author

nemobis commented Jun 16, 2023 via email

nemobis added a commit that referenced this issue Jun 17, 2023
@nemobis
Copy link
Member Author

nemobis commented Jun 17, 2023

I'm now running the venerable checkalive.pl with a 5 seconds sleep. Someone with more patience could start running it (or checkalive.py) with higher sleep times, for example 10 seconds, so we'd have a better list within 24 hours or so.

I've updated the docs c09db66

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

The checkalive run is still ongoing, because some requests take over 10 seconds (which suggests some miraheze servers are very overloaded at the moment). So far it has found over 2000 seemingly alive wikis.

At the moment we have wikiteam items for about 2240 distinct miraheze domains. (This search doesn't find miraheze-hosted wikis outside the miraheze.org doman.)

ia search "collection:wikiteam originalurl:*miraheze*" -f originalurl | jq -r .originalurl | cut -f3 -d/ | sort -u > /tmp/2023-06-18_wikiteam_miraheze_originalurl.org

2023-06-18_wikiteam_miraheze_originalurl.org.gz

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

XML history dumps for about 1388 wikis are being uploaded. All the archives are also available at http://federico.kapsi.fi/tmp/mirahezeorg_202306_history.xml.7z.zip temporarily for those who need a faster download than IA permits.

Help is appreciated with verifying that the dumps for each included wiki are complete and valid. The most comprehensive way to test a dump is to actually test importing it into a recent MediaWiki installation.

I also attach the logs from the dump. There are about 75k lines in the errors.log files, mostly about empty revisions. These could be legitimate deletions or some error on our side.

mirahezeorg_202306_logs.zip

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

One of the biggest wikis by XML size is now https://chakuwiki.miraheze.org, with over 30 GB (didn't finish yet), a significant increase from the about 300 MB in the chakuwikimirahezeorg_w-20220626-history.xml.7z dump previously uploaded by Kevin.

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

https://wiki.3805.co.uk/ fails with a certificate error, but the host only serves a Miraheze placeholder, so it looks like the wiki was deleted.

Connection error: HTTPSConnectionPool(host='wiki.3805.co.uk', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by SSLError(CertificateError("hostname 'wiki.3805.co.uk' doesn't match either of '*.miraheze.org', 'miraheze.org'",),))

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

Another case of private wiki, with "all rights reserved" in the footer. O_o
https://s.miraheze.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

@RhinosF1
Copy link
Contributor

And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.

It is usual practice for SRE to run our backup script and upload to Internet Archive before actual database drops.

They should all be on archive.org

@nemobis
Copy link
Member Author

nemobis commented Jun 19, 2023

https://www.sekaipedia.org/wiki/Special:MediaStatistics is among the biggest by image size, with 10 GB FLAC.

@nemobis
Copy link
Member Author

nemobis commented Jun 20, 2023

Finished! I found 6168 live wikis and 1536 dead or non-MediaWiki wikis.

nemobis added a commit that referenced this issue Jun 20, 2023
@nemobis
Copy link
Member Author

nemobis commented Jun 20, 2023

We're very close to the figure of 6400 wikis recently mentioned by Miraheze people, so the current list seems good enough to me.

@nemobis nemobis closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants