diff --git a/README.md b/README.md index 5ef70acb..ee0310ec 100644 --- a/README.md +++ b/README.md @@ -74,15 +74,14 @@ historical archival. middlewares: [ Crawly.Middlewares.DomainFilter, Crawly.Middlewares.UniqueRequest, - Crawly.Middlewares.UserAgent + {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]} ], pipelines: [ {Crawly.Pipelines.Validate, fields: [:url, :title]}, {Crawly.Pipelines.DuplicatesFilter, item_id: :title}, Crawly.Pipelines.JSONEncoder, - {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0 - ], - port: 4001 + {Crawly.Pipelines.WriteToFile, extension: "csv", folder: "/tmp"} + ] ``` 5. Start the Crawl: - `$ iex -S mix` @@ -108,7 +107,7 @@ You can read more here: 1. [x] Pluggable HTTP client 2. [x] Retries support -3. [ ] Cookies support +3. [x] Cookies support 4. [x] XPath support - can be actually done with meeseeks 5. [ ] Project generators (spiders) 6. [ ] UI for jobs management diff --git a/documentation/introduction.md b/documentation/introduction.md deleted file mode 100644 index 79f51247..00000000 --- a/documentation/introduction.md +++ /dev/null @@ -1,147 +0,0 @@ -# Crawly - ---- - -Crawly is a web crawling framework, written in Elixir. - -## Installation - -Crawly requires Elixir v1.7 or higher. - -1. Add Crawly to you mix.exs file - ```elixir - def deps do - [ - {:crawly, "~> 0.8.0"}, - {:floki, "~> 0.26.0"} - ] - end - ``` -2. Update your dependencies with `mix deps.get` - -## Walk-through of an example spider - -In order to show you what Crawly brings to the table, we’ll walk you through an example of a Crawly spider using the simplest way to run a spider. - -Here’s the code for a spider that scrapes blog posts from the Erlang Solutions blog: https://www.erlang-solutions.com/blog.html, following the pagination: - -```elixir -defmodule Esl do - use Crawly.Spider - - @impl Crawly.Spider - def base_url(), do: "https://www.erlang-solutions.com" - - def init() do - [ - start_urls: ["https://www.erlang-solutions.com/blog.html"] - ] - end - - @impl Crawly.Spider - def parse_item(response) do - # Parse response body to Floki document - {:ok, document} = Floki.parse_document(response.body) - - # Getting new urls to follow - urls = - document - |> Floki.find("a.more") - |> Floki.attribute("href") - |> Enum.uniq() - - # Convert URLs into requests - requests = - Enum.map(urls, fn url -> - url - |> build_absolute_url(response.request_url) - |> Crawly.Utils.request_from_url() - end) - - # Extract item from a page, e.g. - # https://www.erlang-solutions.com/blog/introducing-telemetry.html - title = - document - |> Floki.find("article.blog_post h1:first-child") - |> Floki.text() - - author = - document - |> Floki.find("article.blog_post p.subheading") - |> Floki.text(deep: false, sep: "") - |> String.trim_leading() - |> String.trim_trailing() - - time = - document - |> Floki.find("article.blog_post p.subheading time") - |> Floki.text() - - url = response.request_url - - %Crawly.ParsedItem{ - :requests => requests, - :items => [%{title: title, author: author, time: time, url: url}] - } - end - - def build_absolute_url(url, request_url) do - URI.merge(request_url, url) |> to_string() - end -end -``` - -Put this code into your project and run it using the Crawly REST API: -`curl -v localhost:4001/spiders/Esl/schedule` - -When it finishes you will get the ESL.jl file stored on your filesystem containing the following information about blog posts: - -```json -{"url":"https://www.erlang-solutions.com/blog/erlang-trace-files-in-wireshark.html","title":"Erlang trace files in Wireshark","time":"2018-06-07","author":"by Magnus Henoch"} -{"url":"https://www.erlang-solutions.com/blog/railway-oriented-development-with-erlang.html","title":"Railway oriented development with Erlang","time":"2018-06-13","author":"by Oleg Tarasenko"} -{"url":"https://www.erlang-solutions.com/blog/scaling-reliably-during-the-world-s-biggest-sports-events.html","title":"Scaling reliably during the World’s biggest sports events","time":"2018-06-21","author":"by Erlang Solutions"} -{"url":"https://www.erlang-solutions.com/blog/escalus-4-0-0-faster-and-more-extensive-xmpp-testing.html","title":"Escalus 4.0.0: faster and more extensive XMPP testing","time":"2018-05-22","author":"by Konrad Zemek"} -{"url":"https://www.erlang-solutions.com/blog/mongooseim-3-1-inbox-got-better-testing-got-easier.html","title":"MongooseIM 3.1 - Inbox got better, testing got easier","time":"2018-07-25","author":"by Piotr Nosek"} -.... -``` - -## What just happened? - -When you ran the curl command: -`curl -v localhost:4001/spiders/Esl/schedule` - -Crawly runs a spider ESL, Crawly looked for a Spider definition inside it and ran it through its crawler engine. - -The crawl started by making requests to the URLs defined in the start_urls attribute of the spider's init, and called the default callback method `parse_item`, passing the response object as an argument. - -In the parse callback, we loop: - -1. Look through all pagination the elements using a Floki Selector and extract absolute URLs to follow. URLS are converted into Requests, using `Crawly.Utils.request_from_url()` function -2. Extract item(s) (items are defined in separate modules, and this part - will be covered later on) -3. Return a Crawly.ParsedItem structure which is containing new requests to follow and items extracted from the given page, all following requests are going to be processed by the same `parse_item` function. - -Crawly is fully asynchronous. Once the requests are scheduled, they -are picked up by separate workers and are executed in parallel. This -also means that other requests can keep going even if some request -fails or an error happens while handling it. - -While this enables you to do very fast crawls (sending multiple concurrent requests at the same time, in a fault-tolerant way) Crawly also gives you control over the politeness of the crawl through a few settings. You can do things like setting a download delay between each request, limiting the amount of concurrent requests per domain or respecting robots.txt rules - -``` -This is using JSON export to generate the JSON lines file, but you can easily extend it to change the export format (XML or CSV, for example). - -``` - -## What else? - -You’ve seen how to extract and store items from a website using Crawly, but this is just a basic example. Crawly provides a lot of powerful features for making scraping easy and efficient, such as: - -1. Flexible request spoofing (for example user-agents rotation, cookies management (this feature is planned.)) -2. Items validation, using pipelines approach. -3. Filtering already seen requests and items. -4. Filter out all requests which targeted at other domains. -5. Robots.txt enforcement. -6. Concurrency control. -7. HTTP API for controlling crawlers. -8. Interactive console, which allows you to create and debug spiders more easily. diff --git a/documentation/quickstart.md b/documentation/quickstart.md deleted file mode 100644 index 140969db..00000000 --- a/documentation/quickstart.md +++ /dev/null @@ -1,78 +0,0 @@ -# Quickstart - ---- - -Goals: - -- Scrape the Erlang Solutions blog for articles, and scrape the article titles. -- Perform pagination to see more blog posts. - -1. Add Crawly as a dependencies: - ```elixir - # mix.exs - defp deps do - [ - {:crawly, "~> 0.8.0"}, - {:floki, "~> 0.26.0"} - ] - end - ``` - > **Note**: [`:floki`](https://github.com/philss/floki) is used to illustrate data extraction. Crawly is unopinionated in the way you extract data. You may alternatively use [`:meeseeks`](https://github.com/mischov/meeseeks) -2. Fetch dependencies: `$ mix deps.get` -3. Create a spider - - ```elixir - # lib/crawly_example/esl_spider.ex - defmodule EslSpider do - use Crawly.Spider - alias Crawly.Utils - - @impl Crawly.Spider - def base_url(), do: "https://www.erlang-solutions.com" - - @impl Crawly.Spider - def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]] - - @impl Crawly.Spider - def parse_item(response) do - {:ok, document} = Floki.parse_document(response.body) - hrefs = document |> Floki.find("a.more") |> Floki.attribute("href") - - requests = - Utils.build_absolute_urls(hrefs, base_url()) - |> Utils.requests_from_urls() - - title = document |> Floki.find("article.blog_post h1") |> Floki.text() - - %{ - :requests => requests, - :items => [%{title: title, url: response.request_url}] - } - end - end - ``` - -4. Configure Crawly - - By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls: - ```elixir - # in config.exs - config :crawly, - closespider_timeout: 10, - concurrent_requests_per_domain: 8, - middlewares: [ - Crawly.Middlewares.DomainFilter, - {Crawly.Middlewares.RequestOptions, [timeout: 30_000]}, - Crawly.Middlewares.UniqueRequest, - Crawly.Middlewares.UserAgent - ], - pipelines: [ - {Crawly.Pipelines.Validate, fields: [:title, :url]}, - {Crawly.Pipelines.DuplicatesFilter, item_id: :title }, - {Crawly.Pipelines.CSVEncoder, fields: [:title, :url]}, - {Crawly.Pipelines.WriteToFile, extension: "csv", folder: "/tmp" } - ] - ``` -5. Start the Crawl: - - `$ iex -S mix` - - `iex(1)> Crawly.Engine.start_spider(EslSpider)` -6. Results can be seen with: `$ cat /tmp/EslSpider.csv` diff --git a/lib/crawly/manager.ex b/lib/crawly/manager.ex index a414825c..04bfed49 100644 --- a/lib/crawly/manager.ex +++ b/lib/crawly/manager.ex @@ -111,7 +111,7 @@ defmodule Crawly.Manager do maybe_stop_spider_by_timeout( state.name, - items_count, + delta, closespider_timeout_limit ) diff --git a/mix.exs b/mix.exs index 078af8d0..385760b3 100644 --- a/mix.exs +++ b/mix.exs @@ -101,13 +101,12 @@ defmodule Crawly.Mixfile do defp extras do [ - "documentation/introduction.md", - "documentation/quickstart.md", "documentation/tutorial.md", "documentation/basic_concepts.md", "documentation/configuration.md", "documentation/http_api.md", - "documentation/ethical_aspects.md" + "documentation/ethical_aspects.md", + "README.md": [title: "Introduction", file: "README.md"] ] end end