Merge 2decb2c into e04dbe6

elixir-crawly · Dec 10, 2019 · b54f578 · b54f578
2 parents e04dbe6 + 2decb2c
commit b54f578
Show file tree

Hide file tree

Showing 11 changed files with 1,084 additions and 2 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -1,4 +1,4 @@
-# Crawly into
+# Crawly intro
 ---
 
 Crawly is an application framework for crawling web sites and

diff --git a/documentation/assets/logo.png b/documentation/assets/logo.png
diff --git a/documentation/basic_concepts.md b/documentation/basic_concepts.md
@@ -0,0 +1,154 @@
+# Basic concepts
+---
+
+## Spiders
+
+Spiders are modules which define how a certain site (or a group of
+sites) will be scraped, including how to perform the crawl
+(i.e. follow links) and how to extract structured data from their
+pages (i.e. scraping items). In other words, Spiders are the place
+where you define the custom behaviour for crawling and parsing pages
+for a particular site.
+
+For spiders, the scraping cycle goes through something like this:
+
+You start by generating the initial Requests to crawl the first URLs,
+and use a callback function called with the response downloaded
+from those requests.
+
+In the callback function, you parse the response (web page) and return
+a ` %Crawly.ParsedItem{}` struct. This struct should contain new
+requests to follow and items to be stored.
+
+In the callback functions, you parse the page contents, typically using
+Floki (but you can also use any other library you prefer) and generate
+items with the parsed data.
+
+Spiders are executed in the context of Crawly.Worker processes, and
+you can control the amount of concurrent workers via
+`concurrent_requests_per_domain` setting.
+
+All requests are being processed sequentially and are pre-processed by
+Middlewares.
+
+All items are processed sequentially and are processed by Item pipelines.
+
+### Behaviour functions
+
+In order to make a working web crawler, all the behaviour callbacks need
+to be implemented.
+
+`init()` - a part of the Crawly.Spider behaviour. This function should
+return a KVList which contains a `start_urls` entry with a list, which defines
+the starting requests made by Crawly.
+
+`base_url()` - defines a base_url of the given Spider. This function
+is used in order to filter out all requests which are going outside of
+the crawled website.
+
+`parse_item(response)` - a function which defines how a given response
+is translated into the `Crawly.ParsedItem` structure. On the high
+level this function defines the extraction rules for both Items and Requests.
+
+## Requests and Responses
+
+Crawly uses Request and Response objects for crawling web sites.
+
+Typically, Request objects are generated in the spiders and pass
+across the system until they reach the Crawly.Worker process, which
+executes the request and returns a Response object which travels back
+to the spider that issued the request. The Request objects are being
+modified by the selected Middlewares, before hitting the worker.
+
+The request is defined as the following structure:
+``` elixir
+  @type t :: %Crawly.Request{
+    url: binary(),
+    headers: [header()],
+    prev_response: %{},
+    options: [option()]
+    }
+
+@type header() :: {key(), value()}
+```
+
+Where:
+1. url - is the url of the request
+2. headers - define http headers which are going to be used with the
+   given request
+3. options - would define options (like follow redirects).
+
+Crawly uses HTTPoison library to perform the requests, but we have
+plans to extend the support with other pluggable backends like
+selenium and others.
+
+Responses are defined in the same way as HTTPoison responses. See more
+details here: https://hexdocs.pm/httpoison/HTTPoison.Response.html#content
+
+## Parsed Item
+
+ParsedItem is a structure which is filled by the `parse_item/1`
+callback of the Spider. The structure is defined in the following way:
+
+```elixir
+  @type item() :: %{}
+  @type t :: %__MODULE__{
+    items: [item()],
+    requests: [Crawly.Request.t()]
+    }
+
+```
+The parsed item is being processed by Crawly.Worker process, which
+sends all requests to the `Crawly.RequestsStorage` process,
+responsible for pre-processing requests and storing them for the
+future execution, all items are being sent to `Crawly.DataStorage`
+process, which is responsible for pre-processing items and storing them
+on disk.
+
+For now only one Storage backend is supported (writing on disc). But
+in future Crawly will also support work with amazon S3, sql and others.
+
+## Request Middlewares
+
+Crawly is using a concept of pipelines when it comes to processing of
+the elements sent to the system. In this section we will cover the
+topic of requests middlewares - a powerful tool which allows to modify
+the request before sending it to the target website. In most of the
+spider developers would want to modify request headers, which allows
+requests to look more natural to the crawled websites.
+
+At this point Crawly includes the following request middlewares:
+1. `Crawly.Middlewares.DomainFilter` - this middleware will disable
+   scheduling for all requests leading outside of the crawled
+   site. The middleware uses `base_url()` defined in the
+   `Crawly.Spider` behaviour in order to do it's job
+2. ` Crawly.Middlewares.RobotsTxt` - this middleware ensures that
+Crawly respects the robots.txt defined by the target website.
+3. `Crawly.Middlewares.UniqueRequest` - this middleware ensures that
+crawly would not schedule the same URL(request) multiple times.
+4. `Crawly.Middlewares.UserAgent` - this middleware is used to set a
+   User Agent HTTP header. Allows to rotate UserAgents, if the last
+   one is defined as a list.
+
+A list of request middlewares which are going to be used with a given
+project is defined in the project settings.
+
+## Item pipelines
+
+Crawly is using a concept of pipelines when it comes to processing of
+the elements sent to the system. In this section we will cover the
+topic of item pipelines - a tool which is used in order to pre-process
+items before storing them in the storage.
+
+At this point Crawly includes the following Item pipelines:
+1.  `Crawly.Pipelines.Validate` - validates that a given item has all
+the required fields. All items which don't have all required fields
+are dropped.
+2.  `Crawly.Pipelines.DuplicatesFilter` - filters out items which are
+already stored the system.
+3. `Crawly.Pipelines.JSONEncoder`- converts items into JSON format.
+4. `Crawly.Pipelines.CSVEncoder`- converts items into CSV format.
+5. `Crawly.Pipelines.WriteToFile`- Writes information to a given file.
+
+The list of item pipelines used with a given project is defined in the
+project settings.
diff --git a/documentation/ethical_aspects.md b/documentation/ethical_aspects.md
@@ -0,0 +1,12 @@
+# Ethical aspects of crawling
+---
+
+It's important to be polite, when doing a web crawling. You should
+avoid cases when your spiders are putting harm on the scrapped
+websites. As it's mentioned here: https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy#comments-listing
+
+1. A polite crawler respects robots.txt.
+2. A polite crawler never degrades a website’s performance.
+3. A polite crawler identifies its creator with contact information.
+4. A polite crawler is not a pain in the buttocks of system
+administrators.
diff --git a/documentation/http_api.md b/documentation/http_api.md
@@ -0,0 +1,34 @@
+# HTTP API
+---
+
+Crawly supports a basic HTTP API, which allows to control the Engine
+behaviour.
+
+## Starting a spider
+
+The following command will start a given Crawly spider:
+
+```
+curl -v localhost:4001/spiders/<spider_name>/schedule
+```
+
+## Stopping a spider
+
+The following command will stop a given Crawly spider:
+
+```
+curl -v localhost:4001/spiders/<spider_name>/stop
+```
+
+## Getting currently running spiders
+
+```
+curl -v localhost:4001/spiders
+```
+
+## Getting spider stats
+
+```
+curl -v localhost:4001/spiders/<spider_name>/scheduled-requests
+curl -v localhost:4001/spiders/<spider_name>/scraped-items
+```
diff --git a/documentation/installation_guide.md b/documentation/installation_guide.md
@@ -0,0 +1,14 @@
+# Installation guide
+---
+
+Crawly requires Elixir v1.7 or higher. In order to make a Crawly
+project execute the following steps:
+
+1. Generate an new Elixir project: `mix new <project_name> --sup`
+2. Add Crawly to you mix.exs file
+    ```elixir
+    def deps do
+        [{:crawly, "~> 0.6.0"}]
+    end
+    ```
+3. Fetch crawly: `mix deps.get`
diff --git a/documentation/introduction.md b/documentation/introduction.md
@@ -0,0 +1,152 @@
+# Crawly intro
+---
+
+Crawly is an application framework for crawling web sites and
+extracting structured data which can be used for a wide range of
+useful applications, like data mining, information processing or
+historical archival.
+
+## Walk-through of an example spider
+
+In order to show you what Crawly brings to the table, we’ll walk you
+through an example of a Crawly spider using the simplest way to run a spider.
+
+Here’s the code for a spider that scrapes blog posts from the Erlang
+Solutions blog:  https://www.erlang-solutions.com/blog.html,
+following the pagination:
+
+```elixir
+defmodule Esl do
+@behaviour Crawly.Spider
+
+  @impl Crawly.Spider
+  def base_url(), do: "https://www.erlang-solutions.com"
+
+  def init() do
+    [
+      start_urls: ["https://www.erlang-solutions.com/blog.html"]
+    ]
+  end
+
+  @impl Crawly.Spider
+  def parse_item(response) do
+    # Getting new urls to follow
+    urls =
+      response.body
+      |> Floki.find("a.more")
+      |> Floki.attribute("href")
+      |> Enum.uniq()
+
+    # Convert URLs into requests
+    requests =
+      Enum.map(urls, fn url ->
+        url
+        |> build_absolute_url(response.request_url)
+        |> Crawly.Utils.request_from_url()
+      end)
+
+    # Extract item from a page, e.g.
+    # https://www.erlang-solutions.com/blog/introducing-telemetry.html
+    title =
+      response.body
+      |> Floki.find("article.blog_post h1:first-child")
+      |> Floki.text()
+
+    author =
+      response.body
+      |> Floki.find("article.blog_post p.subheading")
+      |> Floki.text(deep: false, sep: "")
+      |> String.trim_leading()
+      |> String.trim_trailing()
+
+    time =
+      response.body
+      |> Floki.find("article.blog_post p.subheading time")
+      |> Floki.text()
+
+    url = response.request_url
+
+    %Crawly.ParsedItem{
+      :requests => requests,
+      :items => [%{title: title, author: author, time: time, url: url}]
+    }
+  end
+
+  def build_absolute_url(url, request_url) do
+    URI.merge(request_url, url) |> to_string()
+  end
+end
+```
+
+Put this code into your project and run it using the Crawly REST API:
+`curl -v localhost:4001/spiders/Esl/schedule`
+
+When it finishes you will get the ESL.jl file stored on your
+filesystem containing the following information about blog posts:
+
+```json
+{"url":"https://www.erlang-solutions.com/blog/erlang-trace-files-in-wireshark.html","title":"Erlang trace files in Wireshark","time":"2018-06-07","author":"by Magnus Henoch"}
+{"url":"https://www.erlang-solutions.com/blog/railway-oriented-development-with-erlang.html","title":"Railway oriented development with Erlang","time":"2018-06-13","author":"by Oleg Tarasenko"}
+{"url":"https://www.erlang-solutions.com/blog/scaling-reliably-during-the-world-s-biggest-sports-events.html","title":"Scaling reliably during the World’s biggest sports events","time":"2018-06-21","author":"by Erlang Solutions"}
+{"url":"https://www.erlang-solutions.com/blog/escalus-4-0-0-faster-and-more-extensive-xmpp-testing.html","title":"Escalus 4.0.0: faster and more extensive XMPP testing","time":"2018-05-22","author":"by Konrad Zemek"}
+{"url":"https://www.erlang-solutions.com/blog/mongooseim-3-1-inbox-got-better-testing-got-easier.html","title":"MongooseIM 3.1 - Inbox got better, testing got easier","time":"2018-07-25","author":"by Piotr Nosek"}
+....
+```
+
+## What just happened?
+
+When you ran the curl command:
+```curl -v localhost:4001/spiders/Esl/schedule```
+
+Crawly runs a spider ESL, Crawly looked for a Spider definition inside
+it and ran it through its crawler engine.
+
+The crawl started by making requests to the URLs defined in the
+start_urls attribute of the spider's init, and called the default
+callback method `parse_item`, passing the response object as an
+argument. In the parse callback, we loop:
+1. Look through all pagination the elements using a Floki Selector and
+extract absolute URLs to follow. URLS are converted into Requests,
+using
+`Crawly.Utils.request_from_url()` function
+2. Extract item(s) (items are defined in separate modules, and this part
+will be covered later on)
+3. Return a Crawly.ParsedItem structure which is containing new
+requests to follow and items extracted from the given page, all
+following requests are going to be processed by the same `parse_item` function.
+
+Crawly is fully asynchronous. Once the requests are scheduled, they
+are picked up by separate workers and are executed in parallel. This
+also means that other requests can keep going even if some request
+fails or an error happens while handling it.
+
+
+While this enables you to do very fast crawls (sending multiple
+concurrent requests at the same time, in a fault-tolerant way) Crawly
+also gives you control over the politeness of the crawl through a few
+settings. You can do things like setting a download delay between each
+request, limiting the amount of concurrent requests per domain or
+respecting robots.txt rules
+
+```
+This is using JSON export to generate the JSON lines file, but you can
+easily extend it to change the export format (XML or CSV, for
+example).
+
+```
+
+## What else?
+
+You’ve seen how to extract and store items from a website using
+Crawly, but this is just a basic example. Crawly provides a lot of
+powerful features for making scraping easy and efficient, such as:
+
+1. Flexible request spoofing (for example user-agents rotation,
+cookies management (this feature is planned.))
+2. Items validation, using pipelines approach.
+3. Filtering already seen requests and items.
+4. Filter out all requests which targeted at other domains.
+5. Robots.txt enforcement.
+6. Concurrency control.
+7. HTTP API for controlling crawlers.
+8. Interactive console, which allows you to create and debug spiders more easily.