Skip to content

Commit

Permalink
Standalone crawly (#244)
Browse files Browse the repository at this point in the history
* Standalone

* Standalone Crawly implementation

Allows to run Crawly and spiders without installing Elixir and creating projects.

1. Create Crawly release
2. Load spiders from SPIDERS_DIR
3. Configure Crawly via crawly.config
4. Allow to force reload spiders list after adding new spiders
  • Loading branch information
oltarasenko committed Mar 24, 2023
1 parent e736aa8 commit 5eeeb2a
Show file tree
Hide file tree
Showing 16 changed files with 295 additions and 5 deletions.
61 changes: 61 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# ===================== base =====================
FROM elixir:alpine as build

# install build dependencies
RUN apk add --update git make gcc libc-dev autoconf libtool automake

# set build dir
WORKDIR /app

# install hex + rebar
RUN mix local.hex --force && \
mix local.rebar --force

ENV MIX_ENV=standalone_crawly

# install mix dependencies
COPY mix.exs mix.lock /app/
COPY priv /app/priv/
COPY rel /app/rel



RUN mix deps.get
RUN mix local.rebar --force
RUN mix deps.compile
RUN mix deps.compile

# build project code
COPY config/config.exs config/
COPY config/crawly.config config/
COPY config/standalone_crawly.exs config/

# Create default config file
# COPY config/app.config /app/config/app.config

# COPY config/runtime.exs config/
COPY lib lib

RUN mix compile

COPY rel rel

## build release
RUN mix release

# =================== release ====================
FROM alpine:latest AS release

RUN apk add --update openssl make gcc libc-dev autoconf libtool automake

WORKDIR /app

RUN apk add --update bash
COPY --from=build /app/_build/standalone_crawly/rel/crawly ./
COPY --from=build /app/config /app/config

RUN mkdir /app/spiders

EXPOSE 4001

ENTRYPOINT [ "/app/bin/crawly", "start_iex" ]
50 changes: 49 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,54 @@ historical archival.
$ cat /tmp/BooksToScrape_<timestamp>.jl
```

## Running Crawly as a standalone application

It's possible to run Crawly as a standalone application for the cases when you just need the data and don't want to install Elixir and all other dependencies.

Follow these steps in order to bootstrap it with the help of Docker:

1. Make a project folder on your filesystem: `mkdir standalone_quickstart`
2. Create a spider inside the folder created on the step 1. Ideally in a subfolder called spiders. For the example purposes we will re-use the: https://github.com/elixir-crawly/crawly/blob/8926f41df3ddb1a84099543293ec3345b01e2ba5/examples/quickstart/lib/quickstart/books_spider.ex

3. Create a configuration file (erlang configuration file format), for example:
``` erlang
[{crawly, [
{closespider_itemcount, 500},
{closespider_timeout, 20},
{concurrent_requests_per_domain, 2},

{middlewares, [
'Elixir.Crawly.Middlewares.DomainFilter',
'Elixir.Crawly.Middlewares.UniqueRequest',
'Elixir.Crawly.Middlewares.RobotsTxt',
{'Elixir.Crawly.Middlewares.UserAgent', [
{user_agents, [
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
<<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
]
}]
}
]
},

{pipelines, [
{'Elixir.Crawly.Pipelines.Validate', [{fields, [title, price, title, url]}]},
{'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
'Elixir.Crawly.Pipelines.Experimental.Preview',
{'Elixir.Crawly.Pipelines.JSONEncoder'}
]
}]
}].
```

** TODO - it would be nice to switch it to human readable format, e.g. YML

4. Now it's time to the Docker container:
``` bash
docker run -e "SPIDERS_DIR=/app/spiders" -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders -v $(pwd)/crawly.config:/app/config/crawly.config crawly:latest
```
5. Now you can open the management interface and manage your spiders from there: localhost:4001. [Management Interface](#management-ui)

## Need more help?

Please use discussions for all conversations related to the project
Expand All @@ -146,7 +194,7 @@ of asynchronous elements (for example parts loaded by AJAX).
You can read more here:
- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering)

## Simple management UI (New in 0.15.0)
## Simple management UI (New in 0.15.0) {#management-ui}
Crawly provides a simple management UI by default on the `localhost:4001`

It allows to:
Expand Down
1 change: 1 addition & 0 deletions config/crawly.config
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[].
38 changes: 38 additions & 0 deletions config/standalone_crawly.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
import Config

config :logger, :console, truncate: :infinity

config :crawly,
fetcher: {Crawly.Fetchers.HTTPoisonFetcher, []},
retry: [
retry_codes: [400],
max_retries: 3,
ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
],

# Stop spider after scraping certain amount of items
closespider_itemcount: 500,
# Stop spider if it does crawl fast enough
closespider_timeout: 20,
concurrent_requests_per_domain: 5,

# Request middlewares
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt,
{Crawly.Middlewares.UserAgent,
user_agents: [
"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41"
]}
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:title, :price, :url]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title},
{Crawly.Pipelines.Experimental.Preview, limit: 100},
Crawly.Pipelines.JSONEncoder
]
10 changes: 10 additions & 0 deletions lib/crawly.ex
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ defmodule Crawly do
Crawly is a fast high-level web crawling & scraping framework for Elixir.
"""

require Logger

@doc """
Fetches a given url. This function is mainly used for the spiders development
when you need to get individual pages and parse them.
Expand Down Expand Up @@ -128,4 +130,12 @@ defmodule Crawly do
"""
@spec list_spiders() :: [module()]
def list_spiders(), do: Crawly.Utils.list_spiders()

@doc """
Loads spiders from a given directory. Store thm in persistant term under :spiders
"""
@spec load_spiders() :: {:ok, [module()]} | {:error, :no_spiders_dir}
def load_spiders() do
Crawly.Utils.load_spiders()
end
end
14 changes: 14 additions & 0 deletions lib/crawly/api.ex
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,20 @@ defmodule Crawly.API.Router do
send_resp(conn, 200, msg)
end

get "/load-spiders" do
loaded_spiders =
case Crawly.load_spiders() do
{:ok, spiders} -> spiders
{:error, _} -> []
end

send_resp(
conn,
200,
"Loaded following spiders from $SPIDERS_DIR: #{inspect(loaded_spiders)}"
)
end

match _ do
send_resp(conn, 404, "Oops! Page not found!")
end
Expand Down
3 changes: 3 additions & 0 deletions lib/crawly/application.ex
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ defmodule Crawly.Application do
use Application

def start(_type, _args) do
# Try to load spiders from the SPIDERS_DIR (for crawly standalone setup)
Crawly.load_spiders()

import Supervisor.Spec, warn: false
# List all child processes to be supervised

Expand Down
49 changes: 47 additions & 2 deletions lib/crawly/utils.ex
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,17 @@ defmodule Crawly.Utils do

@doc """
Returns a list of known modules which implements Crawly.Spider behaviour
Also search for module names under the :spiders key in persistent_term
"""
@spec list_spiders() :: [module()]
def list_spiders do
def list_spiders() do
modules =
get_modules_from_applications() ++
:persistent_term.get(:crawly_spiders, [])

Enum.reduce(
get_modules_from_applications(),
modules,
[],
fn mod, acc ->
try do
Expand Down Expand Up @@ -180,6 +186,45 @@ defmodule Crawly.Utils do
)
end

@doc """
Loads spiders from a given directory. Store thm in persistant term under :spiders
This allows to readup spiders stored in specific directory which is not a part
of Crawly application
"""
@spec load_spiders() :: {:ok, [module()]} | {:error, :no_spiders_dir}
def load_spiders() do
case System.get_env("SPIDERS_DIR", nil) do
nil ->
Logger.error("""
SPIDERS_DIR environment variable needs to be set in order to load
spiders dynamically
""")

{:error, :no_spiders_dir}

dir ->
{:ok, files} = File.ls(dir)

# Remove all previous spiders data from the persistent_term storage
:persistent_term.put(:crawly_spiders, [])

Enum.each(
files,
fn file ->
path = Path.join(dir, file)
[{module, _binary}] = Code.compile_file(path)

# Use persistent term to store information about loaded spiders
spiders = :persistent_term.get(:crawly_spiders, [])
:persistent_term.put(:crawly_spiders, [module | spiders])
end
)
end

{:ok, :persistent_term.get(:crawly_spiders, [])}
end

##############################################################################
# Private functions
##############################################################################
Expand Down
3 changes: 3 additions & 0 deletions mix.exs
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ defmodule Crawly.Mixfile do
{:earmark, "~> 1.2", only: :dev},
{:meck, "~> 0.9", only: :test},
{:excoveralls, "~> 0.14.6", only: :test},

# Add floki only for crawly standalone release
{:floki, "~> 0.33.0", only: :standalone_crawly},
{:logger_file_backend, "~> 0.0.11", only: [:test, :dev]}
]
end
Expand Down
2 changes: 2 additions & 0 deletions mix.lock
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
"ex_doc": {:hex, :ex_doc, "0.25.3", "3edf6a0d70a39d2eafde030b8895501b1c93692effcbd21347296c18e47618ce", [:mix], [{:earmark_parser, "~> 1.4.0", [hex: :earmark_parser, repo: "hexpm", optional: false]}, {:makeup_elixir, "~> 0.14", [hex: :makeup_elixir, repo: "hexpm", optional: false]}, {:makeup_erlang, "~> 0.1", [hex: :makeup_erlang, repo: "hexpm", optional: false]}], "hexpm", "9ebebc2169ec732a38e9e779fd0418c9189b3ca93f4a676c961be6c1527913f5"},
"excoveralls": {:hex, :excoveralls, "0.14.6", "610e921e25b180a8538229ef547957f7e04bd3d3e9a55c7c5b7d24354abbba70", [:mix], [{:hackney, "~> 1.16", [hex: :hackney, repo: "hexpm", optional: false]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm", "0eceddaa9785cfcefbf3cd37812705f9d8ad34a758e513bb975b081dce4eb11e"},
"file_system": {:hex, :file_system, "0.2.10", "fb082005a9cd1711c05b5248710f8826b02d7d1784e7c3451f9c1231d4fc162d", [:mix], [], "hexpm", "41195edbfb562a593726eda3b3e8b103a309b733ad25f3d642ba49696bf715dc"},
"floki": {:hex, :floki, "0.33.1", "f20f1eb471e726342b45ccb68edb9486729e7df94da403936ea94a794f072781", [:mix], [{:html_entities, "~> 0.5.0", [hex: :html_entities, repo: "hexpm", optional: false]}], "hexpm", "461035fd125f13fdf30f243c85a0b1e50afbec876cbf1ceefe6fddd2e6d712c6"},
"gollum": {:hex, :new_gollum, "0.4.0", "89e3e2fc5abd032455341c4a03bcef7042b8d08e02c51df24b99a1a0a1ad69b1", [:mix], [{:httpoison, "~> 1.7", [hex: :httpoison, repo: "hexpm", optional: false]}], "hexpm", "85c68465e8678637638656945677062a4e7086e91a04d5c4bca1027321c74582"},
"hackney": {:hex, :hackney, "1.18.1", "f48bf88f521f2a229fc7bae88cf4f85adc9cd9bcf23b5dc8eb6a1788c662c4f6", [:rebar3], [{:certifi, "~>2.9.0", [hex: :certifi, repo: "hexpm", optional: false]}, {:idna, "~>6.1.0", [hex: :idna, repo: "hexpm", optional: false]}, {:metrics, "~>1.0.0", [hex: :metrics, repo: "hexpm", optional: false]}, {:mimerl, "~>1.1", [hex: :mimerl, repo: "hexpm", optional: false]}, {:parse_trans, "3.3.1", [hex: :parse_trans, repo: "hexpm", optional: false]}, {:ssl_verify_fun, "~>1.1.0", [hex: :ssl_verify_fun, repo: "hexpm", optional: false]}, {:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "a4ecdaff44297e9b5894ae499e9a070ea1888c84afdd1fd9b7b2bc384950128e"},
"html_entities": {:hex, :html_entities, "0.5.2", "9e47e70598da7de2a9ff6af8758399251db6dbb7eebe2b013f2bbd2515895c3c", [:mix], [], "hexpm", "c53ba390403485615623b9531e97696f076ed415e8d8058b1dbaa28181f4fdcc"},
"httpoison": {:hex, :httpoison, "1.8.0", "6b85dea15820b7804ef607ff78406ab449dd78bed923a49c7160e1886e987a3d", [:mix], [{:hackney, "~> 1.17", [hex: :hackney, repo: "hexpm", optional: false]}], "hexpm", "28089eaa98cf90c66265b6b5ad87c59a3729bea2e74e9d08f9b51eb9729b3c3a"},
"idna": {:hex, :idna, "6.1.1", "8a63070e9f7d0c62eb9d9fcb360a7de382448200fbbd1b106cc96d3d8099df8d", [:rebar3], [{:unicode_util_compat, "~>0.7.0", [hex: :unicode_util_compat, repo: "hexpm", optional: false]}], "hexpm", "92376eb7894412ed19ac475e4a86f7b413c1b9fbb5bd16dccd57934157944cea"},
"jason": {:hex, :jason, "1.4.0", "e855647bc964a44e2f67df589ccf49105ae039d4179db7f6271dfd3843dc27e6", [:mix], [{:decimal, "~> 1.0 or ~> 2.0", [hex: :decimal, repo: "hexpm", optional: true]}], "hexpm", "79a3791085b2a0f743ca04cec0f7be26443738779d09302e01318f97bdb82121"},
Expand Down
3 changes: 3 additions & 0 deletions priv/list.html.eex
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
<div class="leftcolumn">
<div class="card">
<h2>Spiders</h2>

<table>
<tr>
<th>Spider name</td>
Expand All @@ -26,6 +27,8 @@
</tr>
<% end %>
</table>
<br />
<input type = "button" onclick = "get('Reload', '/load-spiders')" value = "Reload spiders">
</div>
</div>
<div class="rightcolumn">
Expand Down
8 changes: 8 additions & 0 deletions rel/env.bat.eex
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
@echo off
rem Set the release to load code on demand (interactive) instead of preloading (embedded).
rem set RELEASE_MODE=interactive

rem Set the release to work across nodes.
rem RELEASE_DISTRIBUTION must be "sname" (local), "name" (distributed) or "none".
rem set RELEASE_DISTRIBUTION=name
rem set RELEASE_NODE=<%= @release.name %>
20 changes: 20 additions & 0 deletions rel/env.sh.eex
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/sh

# # Sets and enables heart (recommended only in daemon mode)
# case $RELEASE_COMMAND in
# daemon*)
# HEART_COMMAND="$RELEASE_ROOT/bin/$RELEASE_NAME $RELEASE_COMMAND"
# export HEART_COMMAND
# export ELIXIR_ERL_OPTIONS="-heart"
# ;;
# *)
# ;;
# esac

# # Set the release to load code on demand (interactive) instead of preloading (embedded).
# export RELEASE_MODE=interactive

# # Set the release to work across nodes.
# # RELEASE_DISTRIBUTION must be "sname" (local), "name" (distributed) or "none".
# export RELEASE_DISTRIBUTION=name
# export RELEASE_NODE=<%= @release.name %>
8 changes: 8 additions & 0 deletions rel/remote.vm.args.eex
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
## Customize flags given to the VM: https://www.erlang.org/doc/man/erl.html
## -mode/-name/-sname/-setcookie are configured via env vars, do not set them here

## Increase number of concurrent ports/sockets
##+Q 65536

## Tweak GC to run more often
##-env ERL_FULLSWEEP_AFTER 10
10 changes: 10 additions & 0 deletions rel/vm.args.eex
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
## Customize flags given to the VM: https://www.erlang.org/doc/man/erl.html
## -mode/-name/-sname/-setcookie are configured via env vars, do not set them here

## Increase number of concurrent ports/sockets
##+Q 65536

## Tweak GC to run more often
##-env ERL_FULLSWEEP_AFTER 10

-config /app/config/crawly.config

0 comments on commit 5eeeb2a

Please sign in to comment.