Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawly UI experimental integration #97

Merged
merged 4 commits into from
May 12, 2020
Merged

Crawly UI experimental integration #97

merged 4 commits into from
May 12, 2020

Conversation

oltarasenko
Copy link
Collaborator

This PR contains changes required to make integration of Crawly & CrawlyUI possible.

The code is currently considered experimental. I am testing it quite intensively on related demo instances. However, I would take it into the codebase as a part of the experimental namespace.

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Having it as a separate node just for serving the UI seems like a lot of additional overhead. What are the benefits of doing so?

Could we achieve the same end goal (viewing manager stats in real time) through a Cowboy/phoenix process on the engine node? All stats (crawl speeds, request storage count, parsed items count) are already on the manager process as well.

I also don't think that the http endpoint would be so resource heavy as to require a separate node. After all, there is already the basic http api endpoint

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Additionally, it would make more sense to break up the UI into modular pieces, to facilitate integration into existing code bases.

It would be more likely that the end user of the ui already had an admin dashboard. Instead of having separate dedicated endpoints, perhaps the ui could be in html fragments/iframes that can be embedded into an existing dashboard. For example, if I were to integrate the ui into my own codebase, do a partial integration of the ui, as I already have an existing dashboard (with auth etc).

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Perhaps a conditional Boolean setting for enabling the ui would be enough? Where it would conditionally start up a process from the crawly_ui package

@oltarasenko
Copy link
Collaborator Author

Having it as a separate node just for serving the UI seems like a lot of additional overhead. What are the benefits of doing so?
Could we achieve the same end goal (viewing manager stats in real time) through a Cowboy/phoenix process on the engine node? All stats (crawl speeds, request storage count, parsed items count) are already on the manager process as well.
Additionally, it would make more sense to break up the UI into modular pieces, to facilitate integration into existing code bases.

It would be more likely that the end user of the ui already had an admin dashboard. Instead of having separate dedicated endpoints, perhaps the ui could be in html fragments/iframes that can be embedded into an existing dashboard. For example, if I were to integrate the ui into my own codebase, do a partial integration of the ui, as I already have an existing dashboard (with auth etc).

Definitely there is an overhead, however, let me explain my motivation here:

  1. The approach above gives the possibility to run multiple crawly instances on separate nodes from one UI. This is done in order to separate load on multiple instances, otherwise, bandwidth is a bottleneck, at least for us :(.
  2. I am not a great fan of the idea of adding Phoenix to Crawly repo. It looks like overkill, from the dependency's point of view. I want crawly to be simple and extendible. I know other people are building GRPC based frontend for Crawly. I simply don't want to tell how to build it! As soon as Crawly can just route items somewhere anyone (anything) can be Crawly's frontend.
  3. Scaling - yeah I am not even sure that the current postgres based approach will work for our case. E.g. we have millions of items to be inserted per job :(, maybe, in this case, I will switch towards Mongo. Again when I am thinking about having postgres (and potentially Mongo) based deps in Crawly's codebase, I feel like it's not the best place for them to be.

To summarize:

  1. I kind of want to intersect my current production requirements (being able to run massively distributed crawls) with Crawly/CrawlyUI, so I can have a bit more time to spend on the project.
  2. Some users have started to build their own private GRPC based UI. And I support the idea that it should be possible to build more than one UI for the tool
  3. I don't want to bring lots of dependencies to Crawly. E.g. I don't want Crawly to create/manage databases, etc

Hopefully, my arguments were convincing enough!

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

I think the crawly_ui project is not actually concerned about just the ui. It implements both data storage and the ui layer. In my opinion, it should be split up into a crawly_repo (database & migrations only) and crawly_ui (live crawling, can hook into the repo).
This makes it more useful for users who are doing flat-file crawling and just want to see live stats. It also makes the storage part optional, for people who don't want to manage their own distributed storage solution.

I do admit that the multi-node option is good for distributed crawling, but it won't be needed for many users who are not at that scale. Setting up another node shouldn't be a requirement in order to use the ui solution.

I was thinking along the lines of crawly_ui hooking into crawly engine and only implementing the phoenix areas, live streaming of manager/parsed items. This keeps phoenix out of the crawly repository.

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

I have heard of coackroachdb as an postgres alternative for geo distribution. However I don't see why postgres can't handle that amount, all that I'd required is adding more slave nodes.

I'd advise against nosql databases,unless you don't need consistency/normalisation. And even then I'd pick mnesia over mongo.

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Also, I'm not saying that we should add the database and ui as dependencies in the crawly repo. They can be installed separately and inter operate as needed

@oltarasenko
Copy link
Collaborator Author

@Ziinc I've got your point! Actually now I like the idea of also adding a lightweight web just to manage crawly (start/stop/speed/list), without the storage.

I am still a bit concerned about adding something as big as phoenix here, as Phoenix is UI + database in most of the cases. However, we can think & brainstorm it. My colleagues have shown me very very nice command-line based UX: https://github.com/ndreynolds/ratatouille#building-an-application.

Let's agree on the following - let's have a separate task for lightweight CrawlyCommander for people who do not need UI, but let's also merge this experimental middleware, so I can unblock my work on the project using Crawly.

Copy link
Collaborator

@Ziinc Ziinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor code and docs improvements

lib/crawly/utils.ex Outdated Show resolved Hide resolved
lib/crawly/utils.ex Show resolved Hide resolved
lib/crawly/pipelines/experimental/send_to_ui.ex Outdated Show resolved Hide resolved
lib/crawly/pipelines/experimental/send_to_ui.ex Outdated Show resolved Hide resolved
documentation/experimental_ui.md Outdated Show resolved Hide resolved
documentation/experimental_ui.md Outdated Show resolved Hide resolved
documentation/experimental_ui.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
documentation/experimental_ui.md Show resolved Hide resolved
@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Let's agree on the following - let's have a separate task for lightweight CrawlyCommander for people who do not need UI, but let's also merge this experimental middleware, so I can unblock my work on the project using Crawly.

Yes, i think Ratatouille looks very nice, though i think it can be an opt-in tui layer as well for people doing flatfile crawling, just like how the web ui would be opt in. Exposing Elixir-level apis at the Engine/commander(?) level would suffice for this repo. Also, not too sure about the name commander. Hmmm maybe Conductor? Orchestrator? Chief?

While we're at it, we should also probably standardize the http api endpoint to adhere to the same management capabilities.

@Ziinc
Copy link
Collaborator

Ziinc commented May 11, 2020

Oh yea, to make development and testing easier, I think including the any future ui/db repos as submodules inside a extras folder might be a good idea, to keep everything in sync

@oltarasenko
Copy link
Collaborator Author

@Ziinc thanks for looking into it. I think I have addressed all your comments now.

@oltarasenko oltarasenko requested a review from Ziinc May 12, 2020 07:25
Ziinc
Ziinc previously approved these changes May 12, 2020
Copy link
Collaborator

@Ziinc Ziinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

For reference sake: my spider module finder code is here:

    all_detected =
      :code.all_loaded()
      |> Enum.filter(fn {mod, _} -> "#{mod}" =~ ~r"MyApp.Crawler\.Spiders((?!Test).)*$" end)
      |> Enum.map(fn {mod, _} -> mod end)
      |> Enum.uniq()

It can be improved on by checking if the behaviour of a module that matches the regex implements the crawly spider behaviour

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants