Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Perl
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
data/content/2010/12
lib
not used
t
temp
.project
Changes
Crawler.pl
MANIFEST
MClient.pl
MCrawler_DB.sql
MCrawler_config.yml
Makefile
Makefile.PL
README
pm_to_blib
refresh.sh
seeds1.txt

README

MCrawler version 0.01
=====================

The README is used to introduce the module and provide instructions on
how to install the module, any machine dependencies it may have (for
example C compilers and installed libraries) and any other information
that should be provided before the module is installed.

A README file is required for CPAN modules since CPAN extracts the
README file from a module distribution so that people browsing the
archive can use it get an idea of the modules uses. It is usually a
good idea to provide version information here so that people can
decide whether fixes for the module are worth downloading.

INSTALLATION

To install this module type the following:

   perl Makefile.PL				 --> creates make file
   make                          --> automatically checks dependencies and install required packages from CPAN
   make test					 --> tests not working need to write some
   make install					 --> avoid installing it since in development phase
   
   Creating Database
   
   createdb -W -O crawler -U crawler crawlerdb   			   --> crawler is user
   												 			   --> crawlerdb is DB name
   												 			   --> default password is 'crawler'
   
   All default database values can be changed in MCrawler_config.yml
   
   psql -f <root_dir>/MCrawler_DB.sql -U crawler -d crawlerdb  --> creates database
   
   create roles for database clients:
   login psql with super user permissions
   CREATE ROLE "client_"<client_id> WITH LOGIN PASSWORD <client_database_password> ;
   example: CREATE ROLE client_8989 WITH LOGIN PASSWORD 'secret';
   
   Starting the Crawler and Server 
   
   cd <Root_directory>
   perl Crawler.PL
   
   Starting Client
   
   perl MClient.PL
   
   Client command format:
   Making new request   		 --> new_request
   Checking messages from server --> check_inbox
   Queueing a URL				 --> queue_url <request_id> <url>
   Queueing a file with URLs	 --> queue_file <request_id> <file_location>              
   Other Settings				 --> downloads_type <request_id> <user_value>			  not_used
   								 --> depth_of_search <request_id> <user_value>			  default_value -> 3
   								 --> refresh_rate <request_id> <user_value>				  default_value -> 60*60*24
   								 --> allowed_content <request_id> <user_value>			  default_value -> html|txt
   								 --> user_agent <request_id> <user_value>				  default_value -> MCrawler bot (http://cyber.law.harvard.edu)
								 --> dequeue_request <request_id>
								 --> check_status <request_id>
   
   format of seeds.txt file is new line seperated URLs 
DEPENDENCIES

This module requires these other modules and libraries:

	These are compulsory modules inorder to run Makefile.PL
	
	-> ExtUtils::MakeMaker
	-> ExtUtils::AutoInstall
	
SAMPLE INPUT/OUTPUT
Admin creates a role by logging into crawler database with psql
CREATE ROLE client_8989 WITH LOGIN PASSWORD 'client_password';
change server port(default is 6666) and client id in MClient.PL pass it to client
client starts a request by entering new_request.here is a series of standard commands.

--> initiating a new request
input --> new_request
input --> check_inbox
output --> you(8989) made new request 23

--> sending input to server
input --> queue_url 23 http://wikipedia.org
input --> queue_file 23 ./seeds.txt

--> configuring the request
input --> depth_of_search 23 10
input --> refresh_rate 23 7200
input --> user_agent 23 MCrawler-bot/1.0 (http://cyber.law.harvard.edu)

--> committing the request 
input --> commit_request 23

-->status checking
input --> check_status 23
output --> please check your request status on view_8989_23 table.
			Out of 1 URLs,0 URLs are completed and remaining are in processing.

next client login to postgres server with given password by admin.
and can only has SELECT permission on view_8989_23 view.

-->dequeueing
input --> dequeue_request 23 
dequeue removes request_id from the queue and stops any requests with that request_id.
And any data in database with that request_id will be removed.


if client has module to contact server directly syntax of communication can be seen after executing every command.

COPYRIGHT AND LICENCE

Copyright (C) 2010 by aditya

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.10.1 or,
at your option, any later version of Perl 5 you may have available.


Something went wrong with that request. Please try again.