GitHub - batikanturkmen/link-crawler: crawl given link and create graph according to link and its adjacent that is in the same domain

Link Crawler

The aim of this project is to extract and graph all the links that can be reached through the given link.

Requirements

Docker

Run

Project can run with docker-compose up command.

For terminating project you can use docker-compose up command. If you make change on python files, you have to rerun dockerization stages without using cache. docker-compose build --no-cache crawler-master crawler-worker command can be used for no-cache re-build

Project outline

Master Node

Master node is responsible for writing links that to be crawled to the kafka topic. Moreover, it processes responses of worker nodes, and write them to the graph database.

Worker Node(s)

Worker nodes are dummy and size of these can be increased to achieve better performance. These nodes read links to be crawled kafka topic, crawl and write its findings to different kafka topic.

Communication

Apache Kafka is used as message broker. Kafka Cluster and related components can be monitored and managed by Landoop UI at http://localhost:3030/. This UI helps us to see topics, schemas and connectors.

Kafka consumer groups helps us to distribute messages among workers with respect to round robin. On the other hand, we have to create multiple partition (in our case it is 3) to support multiple workers. According to the architecture of the Kafka, a partition can be consumed by only one consumer, while a consumer can consume more than one partition.

Storage

To be able to store and present connections, this project is using neo4j graph databases. neo4j has web interface in it that helps both querying and presenting the stored data.

Example graph output can be seen as follows;

Database usage

You can access neo4j database ui at http://localhost:7474/ (username: neo4j, password: batikan) and use following query to see whole graph;

MATCH(node)
RETURN node

You can also interact with graph nodes and see how to connect each other as follows.

Query of upper image is:

MATCH(s:URL{link: 'https://www.afiniti.com/'})
RETURN s

After running query you should click node -> expand to see connected nodes.

Future Works

Add multi-stage build to reduce docker image sizes.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
assets		assets
.gitignore		.gitignore
Dockerfile-master		Dockerfile-master
Dockerfile-worker		Dockerfile-worker
README.md		README.md
__init__.py		__init__.py
config.py		config.py
docker-compose.yml		docker-compose.yml
kafka_topic_creator.py		kafka_topic_creator.py
master.py		master.py
persister.py		persister.py
requirements.txt		requirements.txt
worker.py		worker.py

batikanturkmen/link-crawler

Folders and files

Latest commit

History

Repository files navigation

Link Crawler

Requirements

Run

Project outline

Master Node

Worker Node(s)

Communication

Storage

Database usage

Future Works

About

Topics

Resources

Stars

Watchers

Forks

Languages