New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update miraheze.org list without dead wikis #465
Comments
The list also contains closed wikis which were made private (?):
https://trimirdi.miraheze.org/wiki/Main_Page
|
https://mario.miraheze.org redirects to https://mariopedia.org , which causes some confusion
Better have only the final domain in the list. |
Some of the domain names don't even resolve
|
According to this comment, they have lost 25% of their wikis due to a hard drive failure in November 2022: |
And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past. |
@nemobis That domain was probably squatted recently: |
Il 17/06/23 00:00, Thomas Nagy ha scritto:
And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity,
Please see
https://wiki.archiveteam.org/index.php/Miraheze
***@***.***
We tried to coordinate with Miraheze so that we'd have dumps of most of
their wikis before any major deletions, but we don't have an exact
timeline of all past wikis created and deleted and we'll probably never
have one.
Now we need to focus on the wikis which are still online. Later, if/when
Miraheze goes down completely, we can look for any hidden archives for
missing wikis.
|
I'm now running the venerable checkalive.pl with a 5 seconds sleep. Someone with more patience could start running it (or checkalive.py) with higher sleep times, for example 10 seconds, so we'd have a better list within 24 hours or so. I've updated the docs c09db66 |
The checkalive run is still ongoing, because some requests take over 10 seconds (which suggests some miraheze servers are very overloaded at the moment). So far it has found over 2000 seemingly alive wikis. At the moment we have wikiteam items for about 2240 distinct miraheze domains. (This search doesn't find miraheze-hosted wikis outside the miraheze.org doman.)
|
XML history dumps for about 1388 wikis are being uploaded. All the archives are also available at http://federico.kapsi.fi/tmp/mirahezeorg_202306_history.xml.7z.zip temporarily for those who need a faster download than IA permits. Help is appreciated with verifying that the dumps for each included wiki are complete and valid. The most comprehensive way to test a dump is to actually test importing it into a recent MediaWiki installation. I also attach the logs from the dump. There are about 75k lines in the errors.log files, mostly about empty revisions. These could be legitimate deletions or some error on our side. |
One of the biggest wikis by XML size is now https://chakuwiki.miraheze.org, with over 30 GB (didn't finish yet), a significant increase from the about 300 MB in the chakuwikimirahezeorg_w-20220626-history.xml.7z dump previously uploaded by Kevin. |
https://wiki.3805.co.uk/ fails with a certificate error, but the host only serves a Miraheze placeholder, so it looks like the wiki was deleted.
|
Another case of private wiki, with "all rights reserved" in the footer. O_o |
It is usual practice for SRE to run our backup script and upload to Internet Archive before actual database drops. They should all be on archive.org |
https://www.sekaipedia.org/wiki/Special:MediaStatistics is among the biggest by image size, with 10 GB FLAC. |
|
We're very close to the figure of 6400 wikis recently mentioned by Miraheze people, so the current list seems good enough to me. |
I've used the current version of miraheze-spider.py to update the list of wikis: 40a1f35
There were thousands of deletions and additions. Were there really so many wikis deleted and created?
We also need a stricter mode which would iterate through all the results and remove those which respond with an HTTP 404, like https://crystalsmp.miraheze.org/ .
The text was updated successfully, but these errors were encountered: