Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate the static documentation to ex_doc #32

Merged
merged 3 commits into from
Dec 10, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 25 additions & 106 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,136 +1,55 @@
# Crawly

[![Build Status](https://travis-ci.com/oltarasenko/crawly.svg?branch=master)](https://travis-ci.com/oltarasenko/crawly)
[![Coverage Status](https://coveralls.io/repos/github/oltarasenko/crawly/badge.svg?branch=coveralls)](https://coveralls.io/github/oltarasenko/crawly?branch=coveralls)
# Overview

## Overview

Crawly is an application framework for crawling web sites and
extracting structured data which can be used for a wide range of
useful applications, like data mining, information processing or
historical archival.

# Requirements
## Requirements

1. Elixir "~> 1.7"
1. Elixir "~> 1.7"
2. Works on Linux, Windows, OS X and BSD

# Installation

1. Generate an new Elixir project: `mix new <project_name> --sup`
2. Add Crawly to you mix.exs file
```elixir
def deps do
[{:crawly, "~> 0.6.0"}]
end
```
3. Fetch crawly: `mix deps.get`

# Quickstart
## Installation

In this section we will show how to bootstrap a small project and to setup
Crawly for proper data extraction.
Add Crawly to you mix.exs file

1. Create a new Elixir project: `mix new crawly_example --sup`
2. Add Crawly to the dependencies (mix.exs file):
```elixir
defp deps do
[
{:crawly, "~> 0.6.0"}
]
def deps do
[{:crawly, "~> 0.6.0"}]
end
```
3. Fetch dependencies: `mix deps.get`
4. Define Crawling rules (Spider)
```elixir
cat > lib/crawly_example/esl_spider.ex << EOF
defmodule EslSpider do
@behaviour Crawly.Spider
alias Crawly.Utils

@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"

@impl Crawly.Spider
def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

@impl Crawly.Spider
def parse_item(response) do
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")
## Documentation

requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()
- [API Reference](https://hexdocs.pm/crawly/api-reference.html#content)
- [Quickstart](https://hexdocs.pm/crawly/quickstart.html)
- [Tutorial](https://hexdocs.pm/crawly/tutorial.html)

title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
## Roadmap

%{
:requests => requests,
:items => [%{title: title, url: response.request_url}]
}
end
end
EOF
```

5. Configure Crawly:
By default Crawly does not require any configuration. But obviously you will need
a configuration for fine tuning the Crawls:

```elixir
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
follow_redirects: true,
closespider_itemcount: 1000,
output_format: "csv",
item: [:title, :url],
item_id: :title,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
Crawly.Pipelines.Validate,
Crawly.Pipelines.DuplicatesFilter,
Crawly.Pipelines.CSVEncoder,
Crawly.Pipelines.WriteToFile
]
```


6. Start the Crawl:
- `iex -S mix`
- `Crawly.Engine.start_spider(EslSpider)`

7. Results can be seen in: `cat /tmp/EslSpider.csv`


# Documentation

Documentation is available online at
https://oltarasenko.github.io/crawly/#/ and in the docs directory.

# Tutorial

The crawly tutorial: https://oltarasenko.github.io/crawly/#/?id=crawly-tutorial

# Roadmap
1. [ ] Cookies support
2. [ ] XPath support
3. [ ] Pluggable HTTP client
4. [ ] Project generators (spiders)
5. [ ] Retries support
1. [ ] Pluggable HTTP client
2. [ ] Retries support
3. [ ] Cookies support
4. [ ] XPath support
5. [ ] Project generators (spiders)
6. [ ] UI for jobs management

# We are looking for contributors
## Articles

We would gladly accept your contributions!

# Articles
1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html

# Example projects
## Example projects

1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
2. E-commerce websites: https://github.com/oltarasenko/products-advisor
3. Car shops: https://github.com/oltarasenko/crawly-cars

## Contributors

We would gladly accept your contributions! Please refer to the `Under The Hood` section on [HexDocs](https://hexdocs.pm/crawly/) for modules documentation.
Loading