- K H M Burhan Uddin (ID: 2307264)
- Mubashar Ali (ID: 2307351)
- **Distributed Crawler **: https://github.com/brhn-me/distributed-crawler
- **Distributed Crawler Registry **: https://github.com/brhn-me/distributed-crawler-registry
The Distributed Web Crawler project is designed to harness the collective processing capabilities of multiple devices for efficient webpage retrieval from the Internet. As an essential tool for search engines and various sectors including Natural Language Processing (NLP), this crawler facilitates extensive data collection across the web.
Operating within a distributed framework, the project tackles key challenges associated with web crawling, such as resource discovery, node heartbeat monitoring, leader election, message queuing, distributed memory management, and fault tolerance.
The architecture of this system builds upon conventional crawler designs by incorporating advanced mechanisms for URL management, processing of initial seed URLs, webpage fetching, HTML parsing for data extraction, and URL normalization to ensure data consistency.
This project focuses on overcoming the challenges of parallel processing and distributed systems to significantly enhance the speed and scope of web crawling. It achieves efficient, scalable, and comprehensive data extraction, surpassing the limitations of single-device operations or vertical scaling strategies.
The architecture employs a master-worker model, with key components outlined below:
-
Registry (Master Node): Acts as the central management unit for resource discovery, maintaining an up-to-date record of active nodes through a heartbeat mechanism. The registry is pivotal for leader election, promptly identifying and excluding non-responsive nodes based on heartbeat timeouts. In case the master node fails, the registry facilitates the re-election process to ensure uninterrupted operation.
-
Worker Nodes (Crawlers): These nodes are responsible for the actual web crawling. To support distributed operations, the system employs Redis as a distributed cache memory, serving as a queue for new URLs discovered by the crawler nodes. Once a webpage is crawled, its URL is marked as visited within this shared queue.
-
Distributed Message Queue using Redis: Central to the system's scalability and communication efficiency is the implementation of a distributed message queue powered by Redis. This queue orchestrates the distribution of tasks among worker nodes, ensuring load balancing and enabling asynchronous processing. The Redis message queue significantly enhances the system's ability to manage large volumes of URLs and data exchange between nodes, providing a robust mechanism for task allocation and status updates.
-
Data Structure: Crawler nodes utilize a trie data structure for storing retrieved webpages.
-
Communication: Most communication is done by Remote procedure calls.
-
Payload: Payloads of communication are sent in JSON format.
-
Experimental fault tolerance: This feature for enhancing fault tolerance, crawled files are shared among all crawler nodes, creating redundant copies to safeguard against data loss.
- Spring Framework: Facilitates node communication within the distributed network.
- Redis: Manages URL queues and acts as the distributed system's message queue.
- Java: The primary programming language for system development.
- REST: The communication protocol between different components.
- HTML/JavaScript: Powers the user interface for resource discovery.
- Docker: Enables deployment in a virtualized container environment for consistency across different platforms.
This setup guide will take you through deploying the distributed-crawler-registry
(Registry)
and distributed-crawler
(Crawler Nodes). The Registry orchestrates the crawler nodes, which are responsible for the
actual web crawling.
The Registry must be initiated first as it manages all Crawler Nodes, which perform the actual web crawling.
- Java 11+: Required to run the Java applications.
- Maven: Necessary for building the projects.
- Docker: Used for running Redis easily.
- Redis: Acts as the message queue and URL store.
To provide a comprehensive guide, let's include the initial setup details for Java, Redis, and other essential components required to run the Distributed Web Crawler project effectively.
Before diving into the project setup, ensure all necessary tools and technologies are correctly installed and configured.
-
Download Java: Visit the Official Oracle Java website or use OpenJDK available here to download the Java Development Kit (JDK) for your operating system. Java 11 or newer is recommended for this project.
-
Install Java: Follow the installation instructions for your operating system. Typically, this involves running the installer and following the on-screen prompts.
-
Set Environment Variables:
- On Windows, add the JDK's
bin
directory to thePath
environment variable. - On macOS/Linux, add the export command to your shell configuration file (e.g.,
.bashrc
or.zshrc
):export JAVA_HOME=/path/to/your/jdk export PATH=$JAVA_HOME/bin:$PATH
- Verify the installation by opening a terminal or command prompt and typing:
You should see the version of Java you installed.
java -version
- Docker Installation (for running Redis in a Docker container):
- Download and install Docker from the Official Docker website.
- After installation, run Docker and ensure it's working correctly by executing
docker run hello-world
in a terminal or command prompt.
- Running Redis using Docker:
-
Open a terminal or command prompt.
-
Execute the following command to start a Redis container:
docker run --name redis -p 6379:6379 -d redis
-
This command downloads the Redis image if it's not already present and starts a Redis server.
-p 6379:6379
exposes Redis on the default port6379
on your host machine. -
To verify Redis is running, use:
docker ps
You should see the Redis container listed.
-
Download Maven: Go to the Apache Maven Project website and download the latest version of Maven.
-
Install Maven:
- Unzip the distribution archive in the directory of your choice.
- Add the
bin
directory of the created directory to thePATH
environment variable. - Confirm the installation by running
mvn -v
in a terminal or command prompt. You should see the Maven version along with Java version.
-
Clone the Project: Clone the project repository to your local machine.
git clone <repository-url>
-
Start Redis: Use Docker to run Redis. If you already have Redis installed, you can skip this step.
docker run --name redis -p 6379:6379 -d redis
-
Navigate to the
distributed-crawler-registry
project directory. -
Build the Project using Maven:
mvn clean install
-
Run the Registry:
java -jar target/distributed-crawler-registry-<version>.jar
Replace <version>
with the version number of your project.
-
Navigate to the
distributed-crawler
directory. -
Configure
application.properties
:Edit the
src/main/resources/application.properties
file to specify configurations:registry.server=http://<registry-ip>:8085 server.port=9091 spring.data.redis.url=redis://<redis-ip>:6379 data.dir=/path/to/crawl/data
registry.server
: URL of the registry server.server.port
: Port for the crawler service. Change this for running multiple instances on the same machine.spring.data.redis.url
: Redis connection URL.data.dir
: Directory for storing crawled data.
-
Build the Crawler:
Every time you change
application.properties
, you must rebuild the project.mvn clean install
-
Run the Crawler Node:
java -jar target/distributed-crawler-<version>.jar
To run multiple instances on the same machine, modify
server.port
for each instance and follow the build and run steps again.
- Check Registry Logs: Ensure it logs connections from the crawler nodes.
- Monitor Redis: Use the Redis CLI or GUI tools to watch the message queues and confirm URLs are being processed.
- Crawler Operation: Confirm that the crawler nodes are fetching and analyzing web pages as expected.
- Ensure all IP addresses and ports in
application.properties
are correct and match your environment setup. - If running multiple crawler nodes on the same machine, remember to assign unique ports to each.
- Verify that the specified
data.dir
path exists and is writable by the crawler application.
The seeds.txt
file in the resources
folder of your distributed-crawler
project plays a critical role in
initializing the crawling process by providing a set of start URLs (seeds). To adapt the crawler to new domains or focus
areas, you may need to update this file with new seed URLs. Here's how to do it:
-
Locate the File: Navigate to the
src/main/resources
directory within thedistributed-crawler
project. Here, you'll find theseeds.txt
file. -
Edit the File: Open
seeds.txt
in a text editor of your choice. You'll see a list of URLs, each on a new line, like so:https://en.wikipedia.org/wiki/Main_Page https://www.bbc.com
-
Add or Remove URLs: Modify the list as needed:
- To add a new website, simply append its URL to a new line at the end of the file.
- To remove an existing website, delete its corresponding line.
For example, to add https://www.example.com
, your seeds.txt
would look like this:
```
https://en.wikipedia.org/wiki/Main_Page
https://www.bbc.com
https://www.example.com
```
- Save Your Changes: After modifying the file, save your changes and close the editor.
To ensure the updated seeds.txt
takes effect, you need to rebuild the distributed-crawler
project:
mvn clean install
This command compiles the project and packages it, including the updated seeds.txt
within the build artifact. When you
next run your crawler node, it will use the new seed URLs for the crawling process.
- URL Format: Ensure all URLs added are correctly formatted and accessible. Invalid or inaccessible URLs may cause errors during the crawl.
- Seed Diversity: For a broader crawl, include seeds from different domains and hierarchies.
- Regular Updates: Periodically review and update your seeds to align with the crawler's goals and to adapt to changes on the web.
- Seed URLs: A limited number to initiate the crawl, ranging from 5 to 20, simulating focused and broad crawling strategies.
- Queue Size: The potential to handle an expansive queue, with sizes possibly exceeding 100,000+ URLs, demonstrating the system's scalability.
- Visited URLs: The total count of unique URLs the system successfully crawls and processes.
- Number of URLs Fetched: The total number of URLs retrieved by the crawler.
This scenario explores the system's depth of crawl starting from a minimal set of highly targeted seed URLs.
- Configuration: 4-5 crawler nodes, each supplied with 5 seed URLs, focusing on specific domains or topics.
- Results:
- Queue Size: Rapidly expanded to over 100,000 URLs.
- Visited URLs: Approximately 20,000 unique URLs were crawled, showing the system's depth of exploration from a few seeds.
- Total Crawl Time: About 5 hours
This scenario tests the system's ability to handle a broad exploration starting from a wider array of seed URLs.
- Configuration: 4-5 crawler nodes, each with 20 diverse seed URLs spread across various domains.
- Results:
- Queue Size: Reached at 300,000+ URLs.
- Visited URLs: Managed to crawl around 30,000 unique URLs.
- Total Crawl Time: Approximately 5 hours.