Setting up a crawler to list all MediaWiki wikis in the web #59

emijrp · 2014-06-25T10:40:31Z

From nemow...@gmail.com on May 04, 2013 12:28:07

To grow our wikiteam collection of wikis, I have to increase our list of wikis. To archive our first 4500 wikis, we've used Andrew Pavlo's list. Now I want to adapt his crawling framework (see source linked at http://www.cs.brown.edu/~pavlo/mediawiki/ , and its README) to have a more up to date and complete list. I created my settings.py, used pip to install django 1.2, installed MySQL-python from my repositories, replaced httplib2 with httplib... and finally got stuck with MySQL errors. Unless someone else runs it for me I need something simpler, most of the features in the original graffiti etc. are excessive and in particular there's no reason why I should need a database, I'd like to modify/strip it and get a self-contained version just to make a list of domains running MediaWiki...

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=59

emijrp · 2014-06-25T10:40:32Z

From seth.woo...@gmail.com on August 12, 2013 08:31:31

I checked out that code, and I didn't see django required anywhere. I think I was looking at the wrong thing? Would you mind forking that into a VCS somewhere online, or pointing me directly to the root and I'll put it up on github?

emijrp · 2014-06-25T10:40:33Z

From nemow...@gmail.com on August 12, 2013 10:03:07

Thanks for looking. It's not my code or repo; I just followed http://graffiti.cs.brown.edu/svn/graffiti/README

emijrp · 2014-06-25T10:40:34Z

From nemow...@gmail.com on October 25, 2013 05:45:18

Labels: -Type-Defect -Priority-Medium Type-Enhancement Priority-High

emijrp · 2014-06-25T10:40:35Z

From nemow...@gmail.com on December 01, 2013 04:01:47

I made something with ruby mechanize: https://gist.github.com/nemobis/7718061 I've learnt a lot making it (euphemism for "crashed my head") but I'm not sure it will be useful because the search results actually returned by Google (or Yahoo) are many less than "promised".
For instance, searching for "Magnus Manske, Brion Vibber, Lee Daniel Crocker" should be a rather reliable way to find only one page per site (Special:Version) and gives an estimate of 25k results, but actually returns around a hundred.

emijrp · 2014-06-25T10:40:35Z

From nemow...@gmail.com on January 31, 2014 07:27:12

Status: Started

emijrp · 2014-06-27T08:06:36Z

Currently Google has reduced very much the number of page results it returns. If you find any reliable search engine that doesn't cut the results and returns relevant links, we can research how to scrape it.

nemobis · 2014-06-27T08:29:55Z

Emilio J. Rodríguez-Posada, 27/06/2014 10:06:

Currently Google has reduced very much the number of page results it
returns. If you find any reliable search engine that doesn't cut the
results and returns relevant links, we can research how to scrape it.

I've already worked around Google's limitations with the script above
and found several thousands wikis, but not that many. You can run it
from Spain, it will give slightly different results and add some URLs we
don't know.

I'm afraid however search engines don't help much, despite all the
effort I put into it: so I'm currently working on using services like
builtwith.com and 80plus.

emijrp · 2014-06-27T08:50:02Z

Would Internet Archive give us a list of MediaWiki sites from the WayBack Machine? Currently you can't search the archived sites, but I'm sure they can do it from intranet with some interface.

Ask Jason/Alex?

nemobis · 2014-06-27T09:03:32Z

Emilio J. Rodríguez-Posada, 27/06/2014 10:50:

Would Internet Archive give us a list of MediaWiki sites from the
WayBack Machine? Currently you can't search the archived sites, but I'm
sure they can do it from intranet with some interface.

Ask Jason/Alex?

I don't think the Internet Archive has any reasonable infrastructure to
search even just the of their archived pages; @ab2525 would know.
Asking IA's help would mean asking them several person-hours work at
best, probably person-days; or transferring petabytes of data to
analyse. I can imagine it as rather expensive.

Services to crawl a few hundreds thousands/millions sites and search the
respective cost a handful hundreds dollars, I found: do you think
they're not worth it?

emijrp · 2014-06-27T09:30:32Z

Obviously I don't know how they have built the Wayback Machine, but I'm sure they have a way to query URLs. There is a basic search engine https://archive.org but not useful for our task.

I only want a list of archived links that ends in "/Special:Version" or something that helps us to find MediaWiki wikis. Later we can do post-processing and exclude false positives, dead urls, etc. I'm sure they have an index anywhere that can be filtered with 'grep' or whatever.

nemobis · 2014-06-27T10:31:55Z

Emilio J. Rodríguez-Posada, 27/06/2014 11:30:

I only want a list of archived links that ends in "/Special:Version"

*Special:Version$ is an option, true. The data in question would be
https://archive.org/details/waybackcdx

emijrp · 2014-06-27T13:03:07Z

I have coded a basic spider for Referata results in Google, resulting in findind more than 100 new wikis 7a6ef18

Run the script: python referata-spider.py > newlist
cut -d'/' -f1-3 newlist | sort | uniq > newlist2
cat newlist2 referata.com | sort | uniq > newlist3
diff referata.com newlist3
mv newlist3 referata.com

I will see if other wikifarms are supported with this approach.

nemobis · 2014-06-27T13:07:39Z

Emilio J. Rodríguez-Posada, 27/06/2014 15:03:

I have coded a basic spider for Referata results in Google, resulting in
findin > 100 new wikis 7a6ef18
7a6ef18

Nice. As you can see in my gist, I forgot to try referata queries.

emijrp · 2014-06-27T14:23:35Z

Added spider for shoutwiki and updated the list 90c442a

emijrp · 2014-06-27T15:00:34Z

Added spider for wikkii and updates the list c420d4d

plexish · 2014-06-27T19:35:22Z

Isn't Special:Version blocked by robots.txt by default and only linked to from a small number of places on-wiki anyway, so generally not the best way to find MW sites?

Ideas:
Search for site:*/Special:Recentchanges, which is linked from every page on a wiki so much more likely to be caught by search engines, though still robots.txt blocked.
Search for inurl:/Main_Page and the default translation in a bunch of languages, a large majority of wikis seem to keep main page. (inurl apparently returns ~double the results of a normal site: search?)
Search for inurl:/wiki/ mediawiki

Grab but don't yet visit all URLs, use some regex to normalize them, strip duplicates, strip sites you already have, then run checkalive against them to see if they are living MW sites?

nemobis · 2014-06-27T19:48:15Z

etesp, 27/06/2014 21:35:

Isn't Special:Version blocked by robots.txt by default and only linked
to from a small number of places on-wiki anyway, so generally not the
best way to find MW sites?

No, usually wikis don't set any robot policy; and the most advert ones
use short URLs and block only the non-short variants e.g.
w/index.php?title=Special:Version but not wiki/Special:Version.

And I already did all the rest that you suggest, see the gist above and run it yourself: https://gist.github.com/nemobis/7718061

plexish · 2014-06-28T01:10:30Z

Okay, seemed like the kind of thing you guys probably had covered :)

emijrp · 2014-07-02T14:02:04Z

Added spider for orain.org wikifarm and updated its list a3e6966

emijrp · 2014-07-02T14:21:37Z

Added spider for neoseeker wikifarm and updated its list 636c6a9

emijrp · 2014-07-02T14:34:41Z

wiki-site has some dynamic lists:

http://www.wiki.co.il/active-wiki-en.html
http://www.wiki.co.il/active-wiki-all.html

But any of them is up to 3000 wikis. Our list is over 5000. How was generated?

nemobis · 2014-07-02T14:39:19Z

Emilio J. Rodríguez-Posada, 02/07/2014 16:34:

But any of them is up to 3000 wikis. Our list is over 5000. How was
generated?

Our list came from mutante's wikistats IIRC, I don't know where his
comes from (it's probably years old).

scottdb · 2014-07-03T05:01:22Z

There are 3,140 URLs in those two lists. Would you like me to run my
checkalive.pl script on the lists?

Scott

On Wed, Jul 2, 2014 at 9:34 AM, Emilio J. Rodríguez-Posada <
notifications@github.com> wrote:

wiki-site has some dynamic lists:

http://www.wiki.co.il/active-wiki-en.html
http://www.wiki.co.il/active-wiki-all.html

But any of them is up to 3000 wikis. Our list is over 5000. How was
generated?

—
Reply to this email directly or view it on GitHub
#59 (comment).

emijrp · 2014-07-03T22:28:07Z

2014-07-03 7:01 GMT+02:00 Scott D. Boyd notifications@github.com:

There are 3,140 URLs in those two lists. Would you like me to run my
checkalive.pl script on the lists?

Tomorrow I will merge that two lists with our wiki-site list (5,000) and
remove dupes.

After that, please run your script over the list.

Scott

On Wed, Jul 2, 2014 at 9:34 AM, Emilio J. Rodríguez-Posada <
notifications@github.com> wrote:

wiki-site has some dynamic lists:

http://www.wiki.co.il/active-wiki-en.html
http://www.wiki.co.il/active-wiki-all.html

But any of them is up to 3000 wikis. Our list is over 5000. How was
generated?

—
Reply to this email directly or view it on GitHub
#59 (comment).

—
Reply to this email directly or view it on GitHub
#59 (comment).

emijrp · 2014-07-04T20:16:25Z

2014-07-04 0:27 GMT+02:00 Emilio J. Rodríguez-Posada emijrp@gmail.com:

2014-07-03 7:01 GMT+02:00 Scott D. Boyd notifications@github.com:

There are 3,140 URLs in those two lists. Would you like me to run my

checkalive.pl script on the lists?

Tomorrow I will merge that two lists with our wiki-site list (5,000) and
remove dupes.

After that, please run your script over the list.

wiki-site.com list updated 767123e Now it
has 5839 unique wikis.

I would add a delay to the checkalive because wiki-site is not very
reliable. Also, checking some wikis by hand seems that they have some
servers down.

Thanks

Scott

On Wed, Jul 2, 2014 at 9:34 AM, Emilio J. Rodríguez-Posada <
notifications@github.com> wrote:

wiki-site has some dynamic lists:

http://www.wiki.co.il/active-wiki-en.html
http://www.wiki.co.il/active-wiki-all.html

But any of them is up to 3000 wikis. Our list is over 5000. How was
generated?

—
Reply to this email directly or view it on GitHub
#59 (comment).

—
Reply to this email directly or view it on GitHub
#59 (comment).

nemobis · 2020-02-07T15:06:51Z

This is getting more difficult on Google, but maybe DuckDuckGo can be good enough. Given the limits to pagination, it needs to be done one TLD at a time.

nemobis added this to the 1.0 milestone Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting up a crawler to list all MediaWiki wikis in the web #59

Setting up a crawler to list all MediaWiki wikis in the web #59

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

emijrp commented Jun 27, 2014

plexish commented Jun 27, 2014

nemobis commented Jun 27, 2014

plexish commented Jun 28, 2014

emijrp commented Jul 2, 2014

emijrp commented Jul 2, 2014

emijrp commented Jul 2, 2014

nemobis commented Jul 2, 2014

scottdb commented Jul 3, 2014

emijrp commented Jul 3, 2014

emijrp commented Jul 4, 2014

nemobis commented Feb 7, 2020

Setting up a crawler to list all MediaWiki wikis in the web #59

Setting up a crawler to list all MediaWiki wikis in the web #59

Comments

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 25, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

nemobis commented Jun 27, 2014

emijrp commented Jun 27, 2014

emijrp commented Jun 27, 2014

plexish commented Jun 27, 2014

nemobis commented Jun 27, 2014

plexish commented Jun 28, 2014

emijrp commented Jul 2, 2014

emijrp commented Jul 2, 2014

emijrp commented Jul 2, 2014

nemobis commented Jul 2, 2014

scottdb commented Jul 3, 2014

emijrp commented Jul 3, 2014

emijrp commented Jul 4, 2014

nemobis commented Feb 7, 2020