Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

README.rst

Domeinot 2.0

Crawls and scrapes the web-site of Marnet. The MarNET Network Information Center (MARNET-NIC) is the registrar for the .mk domain.

Dependencies:

Scrapy and Twisted and CouchDB/CouchDBKit

Description

The spider is defined in marnet/spiders/registar.py. This file describes what the spider crawls over (which links it follows) and what pages it scrapes for (see: MarnetSpider.rules).

Xpath rules are used to scrape the needed info. The info is packed in marnet.items.MarnetItem objects and sent to the marnet.pipelines.MarnetPipeline pipeline that stores it to a CouchDB database.

The spiders begins at the page http://dns.marnet.net.mk/registar.php, and then follows each http://dns.marnet.net.mk/registar.php?bukva=<smth> url, and scrapes any http://dns.marnet.net.mk/registar.php?dom=domain.name.mk pages it finds.

Installation

git clone git://github.com/gdamjan/marnet-dns.git
cd marnet-dns
export PYTHONUSERBASE=$PWD/env
pip install --user -r requires.txt

Operation

Set the database COUCHDB_URL in marnet/settings.py and then:

export PYTHONUSERBASE=$PWD/env
export PATH=$PYTHONUSERBASE/bin:$PATH
scrapy crawl marnet

The first time I started it, it worked for 30 minutes, and createad a 261MB ./cache/ folder - which suggests that's the amount of Internet traffic it generated. Since the marnet site doesn't use E-Tags or Timestamps, each run of the crawler will download everything again.

The couchdb database has 16789 documents and is 41MB (a very recent CouchDB version).

About

A crawler/scraper of the macedonian DNS registrar web site - Marnet

(none yet)

Resources

License

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.