A crawler/scraper of the macedonian DNS registrar web site - Marnet
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
marnet
.gitignore
LICENSE
README.rst
requires.txt
scrapy.cfg

README.rst

Domeinot 2.0

Crawls and scrapes the web-site of Marnet. The MarNET Network Information Center (MARNET-NIC) is the registrar for the .mk domain.

Dependencies:

Scrapy and Twisted and CouchDB/CouchDBKit

Description

The spider is defined in marnet/spiders/registar.py. This file describes what the spider crawls over (which links it follows) and what pages it scrapes for (see: MarnetSpider.rules).

Xpath rules are used to scrape the needed info. The info is packed in marnet.items.MarnetItem objects and sent to the marnet.pipelines.MarnetPipeline pipeline that stores it to a CouchDB database.

The spiders begins at the page http://dns.marnet.net.mk/registar.php, and then follows each http://dns.marnet.net.mk/registar.php?bukva=<smth> url, and scrapes any http://dns.marnet.net.mk/registar.php?dom=domain.name.mk pages it finds.

Installation

git clone git://github.com/gdamjan/marnet-dns.git
cd marnet-dns
export PYTHONUSERBASE=$PWD/env
pip install --user -r requires.txt

Operation

Set the database COUCHDB_URL in marnet/settings.py and then:

export PYTHONUSERBASE=$PWD/env
export PATH=$PYTHONUSERBASE/bin:$PATH
scrapy crawl marnet

The first time I started it, it worked for 30 minutes, and createad a 261MB ./cache/ folder - which suggests that's the amount of Internet traffic it generated. Since the marnet site doesn't use E-Tags or Timestamps, each run of the crawler will download everything again.

The couchdb database has 16789 documents and is 41MB (a very recent CouchDB version).