-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #32 from oltarasenko/documentation_improvements
Migrate the static documentation to ex_doc
- Loading branch information
Showing
13 changed files
with
947 additions
and
1,071 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,136 +1,55 @@ | ||
# Crawly | ||
|
||
[![Build Status](https://travis-ci.com/oltarasenko/crawly.svg?branch=master)](https://travis-ci.com/oltarasenko/crawly) | ||
[![Coverage Status](https://coveralls.io/repos/github/oltarasenko/crawly/badge.svg?branch=coveralls)](https://coveralls.io/github/oltarasenko/crawly?branch=coveralls) | ||
# Overview | ||
|
||
## Overview | ||
|
||
Crawly is an application framework for crawling web sites and | ||
extracting structured data which can be used for a wide range of | ||
useful applications, like data mining, information processing or | ||
historical archival. | ||
|
||
# Requirements | ||
## Requirements | ||
|
||
1. Elixir "~> 1.7" | ||
1. Elixir "~> 1.7" | ||
2. Works on Linux, Windows, OS X and BSD | ||
|
||
# Installation | ||
|
||
1. Generate an new Elixir project: `mix new <project_name> --sup` | ||
2. Add Crawly to you mix.exs file | ||
```elixir | ||
def deps do | ||
[{:crawly, "~> 0.6.0"}] | ||
end | ||
``` | ||
3. Fetch crawly: `mix deps.get` | ||
|
||
# Quickstart | ||
## Installation | ||
|
||
In this section we will show how to bootstrap a small project and to setup | ||
Crawly for proper data extraction. | ||
Add Crawly to you mix.exs file | ||
|
||
1. Create a new Elixir project: `mix new crawly_example --sup` | ||
2. Add Crawly to the dependencies (mix.exs file): | ||
```elixir | ||
defp deps do | ||
[ | ||
{:crawly, "~> 0.6.0"} | ||
] | ||
def deps do | ||
[{:crawly, "~> 0.6.0"}] | ||
end | ||
``` | ||
3. Fetch dependencies: `mix deps.get` | ||
4. Define Crawling rules (Spider) | ||
```elixir | ||
cat > lib/crawly_example/esl_spider.ex << EOF | ||
defmodule EslSpider do | ||
@behaviour Crawly.Spider | ||
alias Crawly.Utils | ||
|
||
@impl Crawly.Spider | ||
def base_url(), do: "https://www.erlang-solutions.com" | ||
|
||
@impl Crawly.Spider | ||
def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]] | ||
|
||
@impl Crawly.Spider | ||
def parse_item(response) do | ||
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href") | ||
## Documentation | ||
|
||
requests = | ||
Utils.build_absolute_urls(hrefs, base_url()) | ||
|> Utils.requests_from_urls() | ||
- [API Reference](https://hexdocs.pm/crawly/api-reference.html#content) | ||
- [Quickstart](https://hexdocs.pm/crawly/quickstart.html) | ||
- [Tutorial](https://hexdocs.pm/crawly/tutorial.html) | ||
|
||
title = response.body |> Floki.find("article.blog_post h1") |> Floki.text() | ||
## Roadmap | ||
|
||
%{ | ||
:requests => requests, | ||
:items => [%{title: title, url: response.request_url}] | ||
} | ||
end | ||
end | ||
EOF | ||
``` | ||
|
||
5. Configure Crawly: | ||
By default Crawly does not require any configuration. But obviously you will need | ||
a configuration for fine tuning the Crawls: | ||
|
||
```elixir | ||
config :crawly, | ||
closespider_timeout: 10, | ||
concurrent_requests_per_domain: 8, | ||
follow_redirects: true, | ||
closespider_itemcount: 1000, | ||
output_format: "csv", | ||
item: [:title, :url], | ||
item_id: :title, | ||
middlewares: [ | ||
Crawly.Middlewares.DomainFilter, | ||
Crawly.Middlewares.UniqueRequest, | ||
Crawly.Middlewares.UserAgent | ||
], | ||
pipelines: [ | ||
Crawly.Pipelines.Validate, | ||
Crawly.Pipelines.DuplicatesFilter, | ||
Crawly.Pipelines.CSVEncoder, | ||
Crawly.Pipelines.WriteToFile | ||
] | ||
``` | ||
|
||
|
||
6. Start the Crawl: | ||
- `iex -S mix` | ||
- `Crawly.Engine.start_spider(EslSpider)` | ||
|
||
7. Results can be seen in: `cat /tmp/EslSpider.csv` | ||
|
||
|
||
# Documentation | ||
|
||
Documentation is available online at | ||
https://oltarasenko.github.io/crawly/#/ and in the docs directory. | ||
|
||
# Tutorial | ||
|
||
The crawly tutorial: https://oltarasenko.github.io/crawly/#/?id=crawly-tutorial | ||
|
||
# Roadmap | ||
1. [ ] Cookies support | ||
2. [ ] XPath support | ||
3. [ ] Pluggable HTTP client | ||
4. [ ] Project generators (spiders) | ||
5. [ ] Retries support | ||
1. [ ] Pluggable HTTP client | ||
2. [ ] Retries support | ||
3. [ ] Cookies support | ||
4. [ ] XPath support | ||
5. [ ] Project generators (spiders) | ||
6. [ ] UI for jobs management | ||
|
||
# We are looking for contributors | ||
## Articles | ||
|
||
We would gladly accept your contributions! | ||
|
||
# Articles | ||
1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html | ||
|
||
# Example projects | ||
## Example projects | ||
|
||
1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example | ||
2. E-commerce websites: https://github.com/oltarasenko/products-advisor | ||
3. Car shops: https://github.com/oltarasenko/crawly-cars | ||
|
||
## Contributors | ||
|
||
We would gladly accept your contributions! Please refer to the `Under The Hood` section on [HexDocs](https://hexdocs.pm/crawly/) for modules documentation. |
Oops, something went wrong.