Crawly UI experimental integration #97

oltarasenko · 2020-05-08T22:15:24Z

This PR contains changes required to make integration of Crawly & CrawlyUI possible.

The code is currently considered experimental. I am testing it quite intensively on related demo instances. However, I would take it into the codebase as a part of the experimental namespace.

Ziinc · 2020-05-11T06:50:48Z

Having it as a separate node just for serving the UI seems like a lot of additional overhead. What are the benefits of doing so?

Could we achieve the same end goal (viewing manager stats in real time) through a Cowboy/phoenix process on the engine node? All stats (crawl speeds, request storage count, parsed items count) are already on the manager process as well.

I also don't think that the http endpoint would be so resource heavy as to require a separate node. After all, there is already the basic http api endpoint

Ziinc · 2020-05-11T06:56:56Z

Additionally, it would make more sense to break up the UI into modular pieces, to facilitate integration into existing code bases.

It would be more likely that the end user of the ui already had an admin dashboard. Instead of having separate dedicated endpoints, perhaps the ui could be in html fragments/iframes that can be embedded into an existing dashboard. For example, if I were to integrate the ui into my own codebase, do a partial integration of the ui, as I already have an existing dashboard (with auth etc).

Ziinc · 2020-05-11T07:00:05Z

Perhaps a conditional Boolean setting for enabling the ui would be enough? Where it would conditionally start up a process from the crawly_ui package

oltarasenko · 2020-05-11T08:50:26Z

Having it as a separate node just for serving the UI seems like a lot of additional overhead. What are the benefits of doing so?
Could we achieve the same end goal (viewing manager stats in real time) through a Cowboy/phoenix process on the engine node? All stats (crawl speeds, request storage count, parsed items count) are already on the manager process as well.
Additionally, it would make more sense to break up the UI into modular pieces, to facilitate integration into existing code bases.

It would be more likely that the end user of the ui already had an admin dashboard. Instead of having separate dedicated endpoints, perhaps the ui could be in html fragments/iframes that can be embedded into an existing dashboard. For example, if I were to integrate the ui into my own codebase, do a partial integration of the ui, as I already have an existing dashboard (with auth etc).

Definitely there is an overhead, however, let me explain my motivation here:

The approach above gives the possibility to run multiple crawly instances on separate nodes from one UI. This is done in order to separate load on multiple instances, otherwise, bandwidth is a bottleneck, at least for us :(.
I am not a great fan of the idea of adding Phoenix to Crawly repo. It looks like overkill, from the dependency's point of view. I want crawly to be simple and extendible. I know other people are building GRPC based frontend for Crawly. I simply don't want to tell how to build it! As soon as Crawly can just route items somewhere anyone (anything) can be Crawly's frontend.
Scaling - yeah I am not even sure that the current postgres based approach will work for our case. E.g. we have millions of items to be inserted per job :(, maybe, in this case, I will switch towards Mongo. Again when I am thinking about having postgres (and potentially Mongo) based deps in Crawly's codebase, I feel like it's not the best place for them to be.

To summarize:

I kind of want to intersect my current production requirements (being able to run massively distributed crawls) with Crawly/CrawlyUI, so I can have a bit more time to spend on the project.
Some users have started to build their own private GRPC based UI. And I support the idea that it should be possible to build more than one UI for the tool
I don't want to bring lots of dependencies to Crawly. E.g. I don't want Crawly to create/manage databases, etc

Hopefully, my arguments were convincing enough!

Ziinc · 2020-05-11T12:46:03Z

I think the crawly_ui project is not actually concerned about just the ui. It implements both data storage and the ui layer. In my opinion, it should be split up into a crawly_repo (database & migrations only) and crawly_ui (live crawling, can hook into the repo).
This makes it more useful for users who are doing flat-file crawling and just want to see live stats. It also makes the storage part optional, for people who don't want to manage their own distributed storage solution.

I do admit that the multi-node option is good for distributed crawling, but it won't be needed for many users who are not at that scale. Setting up another node shouldn't be a requirement in order to use the ui solution.

I was thinking along the lines of crawly_ui hooking into crawly engine and only implementing the phoenix areas, live streaming of manager/parsed items. This keeps phoenix out of the crawly repository.

Ziinc · 2020-05-11T12:53:51Z

I have heard of coackroachdb as an postgres alternative for geo distribution. However I don't see why postgres can't handle that amount, all that I'd required is adding more slave nodes.

I'd advise against nosql databases,unless you don't need consistency/normalisation. And even then I'd pick mnesia over mongo.

Ziinc · 2020-05-11T12:55:25Z

Also, I'm not saying that we should add the database and ui as dependencies in the crawly repo. They can be installed separately and inter operate as needed

oltarasenko · 2020-05-11T13:34:29Z

@Ziinc I've got your point! Actually now I like the idea of also adding a lightweight web just to manage crawly (start/stop/speed/list), without the storage.

I am still a bit concerned about adding something as big as phoenix here, as Phoenix is UI + database in most of the cases. However, we can think & brainstorm it. My colleagues have shown me very very nice command-line based UX: https://github.com/ndreynolds/ratatouille#building-an-application.

Let's agree on the following - let's have a separate task for lightweight CrawlyCommander for people who do not need UI, but let's also merge this experimental middleware, so I can unblock my work on the project using Crawly.

Ziinc

minor code and docs improvements

lib/crawly/utils.ex

lib/crawly/pipelines/experimental/send_to_ui.ex

documentation/experimental_ui.md

README.md

documentation/experimental_ui.md

Ziinc · 2020-05-11T15:02:53Z

Let's agree on the following - let's have a separate task for lightweight CrawlyCommander for people who do not need UI, but let's also merge this experimental middleware, so I can unblock my work on the project using Crawly.

Yes, i think Ratatouille looks very nice, though i think it can be an opt-in tui layer as well for people doing flatfile crawling, just like how the web ui would be opt in. Exposing Elixir-level apis at the Engine/commander(?) level would suffice for this repo. Also, not too sure about the name commander. Hmmm maybe Conductor? Orchestrator? Chief?

While we're at it, we should also probably standardize the http api endpoint to adhere to the same management capabilities.

Ziinc · 2020-05-11T15:05:41Z

Oh yea, to make development and testing easier, I think including the any future ui/db repos as submodules inside a extras folder might be a good idea, to keep everything in sync

oltarasenko · 2020-05-12T07:25:02Z

@Ziinc thanks for looking into it. I think I have addressed all your comments now.

Ziinc

Looks good

For reference sake: my spider module finder code is here:

    all_detected =
      :code.all_loaded()
      |> Enum.filter(fn {mod, _} -> "#{mod}" =~ ~r"MyApp.Crawler\.Spiders((?!Test).)*$" end)
      |> Enum.map(fn {mod, _} -> mod end)
      |> Enum.uniq()

It can be improved on by checking if the behaviour of a module that matches the regex implements the crawly spider behaviour

oltarasenko added 2 commits May 9, 2020 00:19

WIP: Early prototype of the next

c4a2d0f

Preparation for CrawlyUI integration

bfd2511

oltarasenko force-pushed the crawly_next branch from f0ffec8 to bfd2511 Compare May 8, 2020 22:19

oltarasenko requested a review from Ziinc May 8, 2020 22:28

Ziinc requested changes May 11, 2020

View reviewed changes

Address codereview comments

cbc63d5

oltarasenko requested a review from Ziinc May 12, 2020 07:25

Ziinc previously approved these changes May 12, 2020

View reviewed changes

Formatting change

c3d3fb1

oltarasenko dismissed Ziinc’s stale review via c3d3fb1 May 12, 2020 08:04

oltarasenko merged commit 24950f6 into master May 12, 2020

oltarasenko mentioned this pull request May 12, 2020

Add lightweight UI for Crawly Management #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawly UI experimental integration #97

Crawly UI experimental integration #97

oltarasenko commented May 8, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

oltarasenko commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020 •

edited

Loading

oltarasenko commented May 11, 2020

Ziinc left a comment

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

oltarasenko commented May 12, 2020

Ziinc left a comment

Crawly UI experimental integration #97

Crawly UI experimental integration #97

Conversation

oltarasenko commented May 8, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

oltarasenko commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020 • edited Loading

oltarasenko commented May 11, 2020

Ziinc left a comment

Choose a reason for hiding this comment

Ziinc commented May 11, 2020

Ziinc commented May 11, 2020

oltarasenko commented May 12, 2020

Ziinc left a comment

Choose a reason for hiding this comment

Ziinc commented May 11, 2020 •

edited

Loading