Simple producer/consumer web link extractor

Installation

Needed:

Python 3.5
Redis
Supervisord

Preferably under virtualenv:

pip install pip-tools (once)

pip-sync requirements*.txt (keeping the PyPI dependencies up-to-date)

Configuration

Customized settings are expected in extractor/settings_local.py. (But they shouldn't be needed.)

Usage

supervisord starts Supervisord with background services (Redis, Celery workers – "consumers"). They can be controlled by supervisorctl then. Logs are stored in the log directory.

./app.py is the "producer", expecting list of URLs on the standard input. They are parsed, so you can use HTML as input: ./app.py < index.html.

Each URL from input is processed by consumers in a way that the referenced webpage is downloaded and parsed for absolute URLs, which are then saved in a JSON file in the out directory. The output file name is an MD5 hash of the input URL.

Example

$ supervisord  # if not already done before
$ ./app.py
http://example.com
Ctrl+D
$ jq < out/a9b9f04336ce0181a08e774e01113b31.json
{
  "url": "http://example.com",
  "links": [
    "http://www.iana.org/domains/example"
  ],
  "version": "0.1.0"
}

Testing

./test.sh (also generates a coverage)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
extractor		extractor
log		log
out		out
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
app.py		app.py
redis.conf		redis.conf
requirements-dev.in		requirements-dev.in
requirements-dev.txt		requirements-dev.txt
requirements-test.in		requirements-test.in
requirements-test.txt		requirements-test.txt
requirements.in		requirements.in
requirements.txt		requirements.txt
supervisord.conf		supervisord.conf
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple producer/consumer web link extractor

Installation

Configuration

Usage

Example

Testing

About

Releases

Packages

Languages

dcn01/link-extractor

Folders and files

Latest commit

History

Repository files navigation

Simple producer/consumer web link extractor

Installation

Configuration

Usage

Example

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages