Skip to content
master
Go to file
Code
This branch is 453 commits behind ArchiveTeam:master.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bot
 
 
 
 
 
 
 
 
db
 
 
doc
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README

1. ArchiveBot

    <SketchCow> Coders, I have a question.
    <SketchCow> Or, a request, etc.
    <SketchCow> I spent some time with xmc discussing something we could
                do to make things easier around here.
    <SketchCow> What we came up with is a trigger for a bot, which can
                be triggered by people with ops.
    <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to
                archive.org. Boom.
    <SketchCow> I can supply machine as needed.
    <SketchCow> Obviously there's some sanitation issues, and it is root
                all the way down or nothing.
    <SketchCow> I think that would help a lot for smaller sites
    <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty
                simple.
    <SketchCow> And just being able to go "bot, get a sanity dump"

2. More info

ArchiveBot has two major backend components: the control node, which
runs the IRC interface and bookkeeping programs, and the crawlers, which
do all the Web crawling.  ArchiveBot users communicate with ArchiveBot
by issuing commands in an IRC channel.

User's guide: http://archivebot.readthedocs.org/en/latest/
Control node installation guide: INSTALL.backend
Crawler installation guide: INSTALL.pipeline

3. License

Copyright 2013 David Yip; made available under the MIT license.  See
LICENSE for details.

4. Acknowledgments

Thanks to Alard (@alard), who added WARC generation and Lua scripting to
GNU Wget.  Wget+lua was the first web crawler used by ArchiveBot.

Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web
crawler.

Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and
tracking down performance problems at scale.

Other thanks go to the following projects:

* Celluloid <http://celluloid.io/>
* Cinch <https://github.com/cinchrb/cinch/>
* CouchDB <http://couchdb.apache.org/>
* Ember.js <http://emberjs.com/>
* Redis <http://redis.io/>
* Seesaw <https://github.com/ArchiveTeam/seesaw-kit>

5. Special thanks

Dragonette, Barnaby Bright, Vienna Teng, NONONO.

The memory hole of the Web has gone too far.
Don't look down, never look away; ArchiveBot's like the wind.

 vim:ts=2:sw=2:tw=72:et

About

ArchiveBot, an IRC bot for archiving websites

Resources

License

Releases

No releases published
You can’t perform that action at this time.