-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawly UI experimental integration #97
Conversation
Having it as a separate node just for serving the UI seems like a lot of additional overhead. What are the benefits of doing so? Could we achieve the same end goal (viewing manager stats in real time) through a Cowboy/phoenix process on the engine node? All stats (crawl speeds, request storage count, parsed items count) are already on the manager process as well. I also don't think that the http endpoint would be so resource heavy as to require a separate node. After all, there is already the basic http api endpoint |
Additionally, it would make more sense to break up the UI into modular pieces, to facilitate integration into existing code bases. It would be more likely that the end user of the ui already had an admin dashboard. Instead of having separate dedicated endpoints, perhaps the ui could be in html fragments/iframes that can be embedded into an existing dashboard. For example, if I were to integrate the ui into my own codebase, do a partial integration of the ui, as I already have an existing dashboard (with auth etc). |
Perhaps a conditional Boolean setting for enabling the ui would be enough? Where it would conditionally start up a process from the crawly_ui package |
Definitely there is an overhead, however, let me explain my motivation here:
To summarize:
Hopefully, my arguments were convincing enough! |
I think the crawly_ui project is not actually concerned about just the ui. It implements both data storage and the ui layer. In my opinion, it should be split up into a crawly_repo (database & migrations only) and crawly_ui (live crawling, can hook into the repo). I do admit that the multi-node option is good for distributed crawling, but it won't be needed for many users who are not at that scale. Setting up another node shouldn't be a requirement in order to use the ui solution. I was thinking along the lines of crawly_ui hooking into crawly engine and only implementing the phoenix areas, live streaming of manager/parsed items. This keeps phoenix out of the crawly repository. |
I have heard of coackroachdb as an postgres alternative for geo distribution. However I don't see why postgres can't handle that amount, all that I'd required is adding more slave nodes. I'd advise against nosql databases,unless you don't need consistency/normalisation. And even then I'd pick mnesia over mongo. |
Also, I'm not saying that we should add the database and ui as dependencies in the crawly repo. They can be installed separately and inter operate as needed |
@Ziinc I've got your point! Actually now I like the idea of also adding a lightweight web just to manage crawly (start/stop/speed/list), without the storage. I am still a bit concerned about adding something as big as phoenix here, as Phoenix is UI + database in most of the cases. However, we can think & brainstorm it. My colleagues have shown me very very nice command-line based UX: https://github.com/ndreynolds/ratatouille#building-an-application. Let's agree on the following - let's have a separate task for lightweight CrawlyCommander for people who do not need UI, but let's also merge this experimental middleware, so I can unblock my work on the project using Crawly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor code and docs improvements
Yes, i think Ratatouille looks very nice, though i think it can be an opt-in tui layer as well for people doing flatfile crawling, just like how the web ui would be opt in. Exposing Elixir-level apis at the Engine/commander(?) level would suffice for this repo. Also, not too sure about the name commander. Hmmm maybe Conductor? Orchestrator? Chief? While we're at it, we should also probably standardize the http api endpoint to adhere to the same management capabilities. |
Oh yea, to make development and testing easier, I think including the any future ui/db repos as submodules inside a |
@Ziinc thanks for looking into it. I think I have addressed all your comments now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
For reference sake: my spider module finder code is here:
all_detected =
:code.all_loaded()
|> Enum.filter(fn {mod, _} -> "#{mod}" =~ ~r"MyApp.Crawler\.Spiders((?!Test).)*$" end)
|> Enum.map(fn {mod, _} -> mod end)
|> Enum.uniq()
It can be improved on by checking if the behaviour of a module that matches the regex implements the crawly spider behaviour
This PR contains changes required to make integration of Crawly & CrawlyUI possible.
The code is currently considered experimental. I am testing it quite intensively on related demo instances. However, I would take it into the codebase as a part of the experimental namespace.