-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting up a crawler to list all MediaWiki wikis in the web #59
Comments
From seth.woo...@gmail.com on August 12, 2013 08:31:31 I checked out that code, and I didn't see django required anywhere. I think I was looking at the wrong thing? Would you mind forking that into a VCS somewhere online, or pointing me directly to the root and I'll put it up on github? |
From nemow...@gmail.com on August 12, 2013 10:03:07 Thanks for looking. It's not my code or repo; I just followed http://graffiti.cs.brown.edu/svn/graffiti/README |
From nemow...@gmail.com on October 25, 2013 05:45:18 Labels: -Type-Defect -Priority-Medium Type-Enhancement Priority-High |
From nemow...@gmail.com on December 01, 2013 04:01:47 I made something with ruby mechanize: https://gist.github.com/nemobis/7718061 I've learnt a lot making it (euphemism for "crashed my head") but I'm not sure it will be useful because the search results actually returned by Google (or Yahoo) are many less than "promised". |
From nemow...@gmail.com on January 31, 2014 07:27:12 Status: Started |
Currently Google has reduced very much the number of page results it returns. If you find any reliable search engine that doesn't cut the results and returns relevant links, we can research how to scrape it. |
Emilio J. Rodríguez-Posada, 27/06/2014 10:06:
I've already worked around Google's limitations with the script above I'm afraid however search engines don't help much, despite all the |
Would Internet Archive give us a list of MediaWiki sites from the WayBack Machine? Currently you can't search the archived sites, but I'm sure they can do it from intranet with some interface. Ask Jason/Alex? |
Emilio J. Rodríguez-Posada, 27/06/2014 10:50:
I don't think the Internet Archive has any reasonable infrastructure to Services to crawl a few hundreds thousands/millions sites and search the |
Obviously I don't know how they have built the Wayback Machine, but I'm sure they have a way to query URLs. There is a basic search engine https://archive.org but not useful for our task. I only want a list of archived links that ends in "/Special:Version" or something that helps us to find MediaWiki wikis. Later we can do post-processing and exclude false positives, dead urls, etc. I'm sure they have an index anywhere that can be filtered with 'grep' or whatever. |
Emilio J. Rodríguez-Posada, 27/06/2014 11:30:
*Special:Version$ is an option, true. The data in question would be |
I have coded a basic spider for Referata results in Google, resulting in findind more than 100 new wikis 7a6ef18 Run the script: python referata-spider.py > newlist I will see if other wikifarms are supported with this approach. |
Added spider for shoutwiki and updated the list 90c442a |
Added spider for wikkii and updates the list c420d4d |
Isn't Special:Version blocked by robots.txt by default and only linked to from a small number of places on-wiki anyway, so generally not the best way to find MW sites? Ideas: Grab but don't yet visit all URLs, use some regex to normalize them, strip duplicates, strip sites you already have, then run checkalive against them to see if they are living MW sites? |
etesp, 27/06/2014 21:35:
No, usually wikis don't set any robot policy; and the most advert ones And I already did all the rest that you suggest, see the gist above and run it yourself: https://gist.github.com/nemobis/7718061 |
Okay, seemed like the kind of thing you guys probably had covered :) |
Added spider for orain.org wikifarm and updated its list a3e6966 |
Added spider for neoseeker wikifarm and updated its list 636c6a9 |
wiki-site has some dynamic lists: http://www.wiki.co.il/active-wiki-en.html But any of them is up to 3000 wikis. Our list is over 5000. How was generated? |
Emilio J. Rodríguez-Posada, 02/07/2014 16:34:
Our list came from mutante's wikistats IIRC, I don't know where his |
There are 3,140 URLs in those two lists. Would you like me to run my Scott On Wed, Jul 2, 2014 at 9:34 AM, Emilio J. Rodríguez-Posada <
|
2014-07-03 7:01 GMT+02:00 Scott D. Boyd notifications@github.com:
After that, please run your script over the list.
|
2014-07-04 0:27 GMT+02:00 Emilio J. Rodríguez-Posada emijrp@gmail.com:
wiki-site.com list updated 767123e Now it I would add a delay to the checkalive because wiki-site is not very Thanks
|
This is getting more difficult on Google, but maybe DuckDuckGo can be good enough. Given the limits to pagination, it needs to be done one TLD at a time. |
From nemow...@gmail.com on May 04, 2013 12:28:07
To grow our wikiteam collection of wikis, I have to increase our list of wikis. To archive our first 4500 wikis, we've used Andrew Pavlo's list. Now I want to adapt his crawling framework (see source linked at http://www.cs.brown.edu/~pavlo/mediawiki/ , and its README) to have a more up to date and complete list. I created my settings.py, used pip to install django 1.2, installed MySQL-python from my repositories, replaced httplib2 with httplib... and finally got stuck with MySQL errors. Unless someone else runs it for me I need something simpler, most of the features in the original graffiti etc. are excessive and in particular there's no reason why I should need a database, I'd like to modify/strip it and get a self-contained version just to make a list of domains running MediaWiki...
Original issue: http://code.google.com/p/wikiteam/issues/detail?id=59
The text was updated successfully, but these errors were encountered: