Skip to content

Commit

Permalink
Readme updated.
Browse files Browse the repository at this point in the history
  • Loading branch information
garncarz committed Dec 16, 2016
1 parent 495adf7 commit 00edbf7
Showing 1 changed file with 33 additions and 3 deletions.
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Simple producer/consumer web link extractor

[![Build Status](https://travis-ci.org/garncarz/link-extractor.svg?branch=master)](https://travis-ci.org/garncarz/link-extractor)
[![Coverage Status](https://coveralls.io/repos/github/garncarz/link-extractor/badge.svg?branch=master)](https://coveralls.io/github/garncarz/link-extractor?branch=master)

Expand All @@ -7,6 +9,7 @@
Needed:
- Python 3.5
- Redis
- Supervisord

Preferably under `virtualenv`:

Expand All @@ -23,9 +26,36 @@ Customized settings are expected in `extractor/settings_local.py`.

## Usage

`redis-server redis.conf`

`celery -A extractor worker -l info`
`supervisord` starts Supervisord with background services
(Redis, Celery workers – "consumers").
They can be controlled by `supervisorctl` then.
Logs are stored in the `log` directory.

`./app.py` is the "producer", expecting list of URLs on the standard input.
They are parsed, so you can use HTML as input: `./app.py < index.html`.

Each URL from input is processed by consumers in a way that
the referenced webpage is downloaded and parsed for absolute URLs,
which are then saved in a JSON file in the `out` directory.
The output file name is an MD5 hash of the input URL.


## Example

```sh
$ supervisord # if not already done before
$ ./app.py
http://example.com
Ctrl+D
$ jq < out/a9b9f04336ce0181a08e774e01113b31.json
{
"url": "http://example.com",
"links": [
"http://www.iana.org/domains/example"
],
"version": "0.1.0"
}
```


## Testing
Expand Down

0 comments on commit 00edbf7

Please sign in to comment.