Skip to content

Commit

Permalink
Merge bfd2511 into 333d6e8
Browse files Browse the repository at this point in the history
  • Loading branch information
oltarasenko committed May 8, 2020
2 parents 333d6e8 + bfd2511 commit 6f645cf
Show file tree
Hide file tree
Showing 7 changed files with 202 additions and 0 deletions.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,20 @@ of asynchronous elements (for example parts loaded by AJAX).
You can read more here:
- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering)

## New: Experimental UI

We have made few first steps towards building UI for Crawly project which would
simplify day by day spider management routines. It could be useful for the
cases when it's required to organize scraping on the highest level and to assure
the highest quality of extracted data.

![](doc/assets/main_page.png?raw=true)
![](doc/assets/items_page.png?raw=true)
![](doc/assets/item_with_filters.png?raw=true)
![](doc/assets/item_preview_example.png?raw=true)

See more at [Experimental UI](https://hexdocs.pm/crawly/experimental_ui.html#content)

## Documentation

- [API Reference](https://hexdocs.pm/crawly/api-reference.html#content)
Expand Down
78 changes: 78 additions & 0 deletions documentation/experimental_ui.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Experimental UI
---

We believe that web scraping is a process. It might seem easy to extract first
data items, however we believe that the data delivery requires a bit more efforts or
a process which supports it!

Our aim is to provide you with the following services:

1. Schedule (start and stop) your spiders on a cloud
2. View running jobs (performance based analysis)
3. View and validate scraped items for quality assurance and data analysis purposes.
4. View individual items and compare them with the actual website.

## Project status

Currently the project is in early alpha stage. We're constantly working on making
it more stable. And currently running it for long running jobs.
So far it did no show major problems, however we accept that there will be problems
on such early stages! If you are one of those who has them, please don't
hesitate to report them here.

## Setting it up

You can find setup examples [here](https://github.com/oltarasenko/crawly_ui/tree/master/examples)

On the highest level it's required to:
1. Add SendToUI pipeline to the list of your item pipelines (before encoder pipelines)
`{Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'}`
2. Organize erlang cluster so Crawly nodes can find CrawlyUI node
in the example above I was using [erlang-node-discovery](https://github.com/oltarasenko/erlang-node-discovery)
application for this task, however any other alternative would also work.
For setting up erlang-node-discovery
- add the following code dependency to deps section of mix.exs
`{:erlang_node_discovery, git: "https://github.com/oltarasenko/erlang-node-discovery"}`
- add the following lines to the config.exs:
```config :erlang_node_discovery,
hosts: ["127.0.0.1", "crawlyui.com"],
node_ports: [
{:ui, 0}
]
```

## Testing it locally with a docker-compose

CrawlyUI ships with a docker compose which brings up UI, worker and database
nodes, so everything is ready for testing with just one command.

In order to try it:
1. clone crawly_ui repo: `git clone git@github.com:oltarasenko/crawly_ui.git`
2. build ui and worker nodes: `docker-compose build`
3. apply migrations: `docker-compose run ui bash -c "/crawlyui/bin/ec eval \"CrawlyUI.ReleaseTasks.migrate\""`
4. run it all: `docker-compose up`

## Live demo

Live demo is available as well. However it might be a bit unstable due to continues
releases process. Please give it a try and let us know what do you think

[Live Demo](http://18.216.221.122/)

## Items browser

One of the cool features of the CrawlyUI is items browser which allows comparing
extracted data with a target website loaded in the IFRAME. However due to the
fact that most of the big sites would block iframes, it will not work for you.
Unless you install special browser extension to ignore X-Frame headers. For example
[Chrome extension](https://chrome.google.com/webstore/detail/ignore-x-frame-headers/gleekbfjekiniecknbkamfmkohkpodhe)

## Gallery

![Main Page](assets/main_page.png?raw=true)
--
![Items browser](assets/items_page.png?raw=true)
--
![Items browser search](assets/item_with_filters.png?raw=true)
--
![Items browser](assets/item_preview_example.png?raw=true)
6 changes: 6 additions & 0 deletions lib/crawly.ex
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,10 @@ defmodule Crawly do
spider.parse_item(response)
end
end

@doc """
Returns a list of known modules which implements Crawly.Spider behaviour
"""
@spec list_spiders() :: [module()]
def list_spiders(), do: Crawly.Utils.list_spiders()
end
42 changes: 42 additions & 0 deletions lib/crawly/pipelines/experimental/send_to_ui.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
defmodule Crawly.Pipelines.Experimental.SendToUI do
@moduledoc """
"""
@behaviour Crawly.Pipeline

require Logger

@impl Crawly.Pipeline
def run(item, state, opts \\ []) do
job_tag =
case Map.get(state, :job_tag, nil) do
nil ->
UUID.uuid1()

tag ->
tag
end

ui_node =
case Keyword.get(opts, :ui_node) do
nil ->
throw(
"No ui node is set. It's required to set a UI node to use " <>
"this pipeline"
)

node ->
node
end

spider_name = state.spider_name |> Atom.to_string()

:rpc.cast(ui_node, CrawlyUI, :store_item, [
spider_name,
item,
job_tag,
Node.self() |> to_string()
])

{item, Map.put(state, :job_tag, job_tag)}
end
end
37 changes: 37 additions & 0 deletions lib/crawly/utils.ex
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,31 @@ defmodule Crawly.Utils do
end
end


@doc """
Returns a list of known modules which implements Crawly.Spider behaviour
"""
@spec list_spiders() :: [module()]
def list_spiders() do
Enum.reduce(
get_modules_from_applications(),
[],
fn mod, acc ->
try do
beh = Keyword.get(mod.__info__(:attributes), :behaviour)

case beh == [Crawly.Spider] do
true ->
[mod] ++ acc
false ->
acc
end

rescue _ -> acc end
end)
end


##############################################################################
# Private functions
##############################################################################
Expand All @@ -159,4 +184,16 @@ defmodule Crawly.Utils do
nil
end
end

@spec get_modules_from_applications() :: [module()]
def get_modules_from_applications do
Enum.reduce(Application.started_applications(), [], fn {app, _descr, _vsn}, acc ->
case :application.get_key(app, :modules) do
{:ok, modules} ->
modules ++ acc
_other ->
acc
end
end)
end
end
1 change: 1 addition & 0 deletions mix.exs
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ defmodule Crawly.Mixfile do
"documentation/configuration.md",
"documentation/http_api.md",
"documentation/ethical_aspects.md",
"documentation/experimental_ui.md",
"readme.md": [title: "Introduction", file: "README.md"]
]
end
Expand Down
24 changes: 24 additions & 0 deletions test/pipelines/experimental/send_to_ui_test.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
defmodule Pipelines.Experimental.SendToUITest do
use ExUnit.Case, async: false

@item %{title: "Title", author: "Me"}
test "job tag is added to the state" do
pipelines = [{Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'}]
state = %{spider_name: PipelineTestSpider}
{@item, state} = Crawly.Utils.pipe(pipelines, @item, state)

assert Map.get(state, :job_tag) != nil
end

test "job tag is not re-generated if pipeline was re-executed" do
pipelines = [{Crawly.Pipelines.Experimental.SendToUI, ui_node: :'ui@127.0.0.1'}]
state = %{spider_name: PipelineTestSpider}
{@item, state} = Crawly.Utils.pipe(pipelines, @item, state)

job_tag = Map.get(state, :job_tag)

{@item, state2} = Crawly.Utils.pipe(pipelines, @item, state)

assert Map.get(state2, :job_tag) == job_tag
end
end

0 comments on commit 6f645cf

Please sign in to comment.