Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcement: Now it's possible to create a spider in UI #141

Closed
oltarasenko opened this issue Nov 23, 2020 · 29 comments
Closed

Announcement: Now it's possible to create a spider in UI #141

oltarasenko opened this issue Nov 23, 2020 · 29 comments
Assignees

Comments

@oltarasenko
Copy link
Collaborator

Hey people,

I want to make a short announcement in the inner circle, before moving forward in other places. I have just released an alpha version of UI based spider generators. You can find them here: http://crawlyui.com/spider/new

And here you can find an example of a spider created this way: http://crawlyui.com/spider/new?spider=ErlangSolutionsBlog

Would really appreciate feedback & ideas: @Ziinc

@oltarasenko
Copy link
Collaborator Author

OK, I have managed to break everything...

@oltarasenko
Copy link
Collaborator Author

Ok, it works again

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 24, 2020

@oltarasenko I'm a little confused as to how the spiders are dynamically created. From skimming the crawly_ui source, I notice that there isn't a base template spider module being used.

As mentioned previously, managing dynamic spiders is definitely very useful, because managing and monitoring dozens of spiders manually is quite a hassle.

Could you give a brief explanation on how the spider is being dynamically instantiated and passed to the crawly engine? I understand that the extraction rules and init is stored in ecto, but what about the spider-level settings? and how does the engine keep track of it, considering that the spider discovery is based on module behaviour implementation?

@oltarasenko
Copy link
Collaborator Author

Hey @Ziinc the feature is quite experimental atm, and some parts would have to be improved (as they look ugly). However I wanted to build this feature, and it feels like it's almost working. Once done, I will be able to re-write it, to make it look normal from the coding point of view. (Of course, if there will be interest in this feature after all. Actually, I've noticed that in general, the interest in scraping and crawling is quite limited in the Elixir community, so it may be the case that these efforts were not needed :().

Regarding the implementation:

  1. Currently we have one spider template, which is defined as a string for now (🤮 ). And the idea is that a spider module is generated on the fly from the string + data from the database (https://github.com/oltarasenko/crawly_ui/blob/master/lib/crawly_ui.ex#L30). I want to make a working version here, and then I will try to find a better way..
  2. Once the create_spider function is called, the module is generated from the string and dynamically loaded into a worker node, so the worker can pick it up
  3. Finally once code is loaded the UI schedules it, similarly as it does it for other spiders.

Other questions:

  1. For now spider level settings are also in the same module as the extraction rules. However, in the future, I plan to allow crawly_ui to schedule a spider with given settings (so settings will not be a part of a given spider anymore)
  2. The engine, for now, does not keep track of it, as the module is not a part of the filesystem of the worker node. So if the worker goes down, UI will have to reload it. Which I think makes sense. In general, I wanted UI to be able to cross schedule everything on all worker nodes added to it. So it can keep a registry of all spiders known by every node. Otherwise, we can also register a loaded spider in the Engine if it's API is extended.

I think that the most interesting part was done on the create spider UI, as it allows you to preview how your item extractors work in the realtime.

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 28, 2020

I think there are two major aspects to this whole issue:

  1. spider management (start/stop), scheduling,and monitoring (ties in with Improvements for spider management #37 and Splitting spider logs #101)
  2. runtime spider creation (create/update/delete)

For me, the key problems I have right now with spider management, is that when webmasters change their design, the spider's extraction phase becomes broken and there isn't a simple way to quickly fix it at runtime.

For aspect 1, i think that the monitoring capabilities are a little lacking in crawly's current state. Scheduling is largely out-of-scope, and management is satisfactory so far. I think having a way to stream logs to a client would be a great core feature to have to resolve #101 , or alternatively realtime aggregated log data. Perhaps we can leverage Telemetry to achieve that in both the core and the crawly_ui repo, so then people using crawly can plug their custom ui's into the crawly telemetry backend easily and browse historical logs, or filter to view dropped items, etc.

This should allow for easy debugging of broken spiders during runtime and allow altering of the extraction rules on-the-fly.

For aspect 2, as a first iteration, i think the attempt at the extraction rules is quite good. I think allowing the users to inject spider code as a string at runtime isn't a very secure route to go. I was thinking that we could perhaps have a protocol focused on response extraction. Crawly can provide a few optional extraction pipelines to extract out ParseItems and new Requests, and this can be declared in the settings.

Because of the recent initialization options PR for #136, providing dynamic extraction configurations (through the pipelines concept) would not be an issue, since said initialization options can then be used in the override_settings callback.

For example, a runtime spider could be defined as so:

def MySpiderTemplate do
  @impl Crawly.Spider
  def init(opts)
    # opts comes from stored state, such as a db
    # can set the spider name. Spiders are now referred to by this name, instead of module atom.
    [name: opts.name, start_urls: opts.start_urls]
  end
  def override_settings(opts) do
  # opts is merged list  of init options args, init result
  # since init opts contains stored state, may contain additional extraction configuration
  [parse: [
    # crawly loads the request's response into a protocol struct, and pipes it through the declared modules. 
    {ExtractRequests, glob: "/products/*"},
    # xpath config for this spider is stored in db
    {ExtractRequests, xpath: opts.request_xpath},
    {ExtractItems, rules: [ ... ]  }
  ]]
  end
  # if the :parse setting key is given, then no need to implement the parse callback
end

This should give enough flexibility for the developer to create spider templates for different situations, as well as configure the loading of the spider's configuration data through their own storage methods.

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 28, 2020

w.r.t. the demand of this feature, i would say that there is a high demand for web crawling in general. Should you eventually productize this (I highly recommend that you do so), creating a way to easily create, manage, and deploy hundreds of spiders at scale is definitely something that many businesses need. The elixir community may be small, but crawly is currently the "go-to" solution for web scraping, judging by the hex downloads compared to crawler.

@maxlorenz
Copy link

I'd also be very interested in how to achieve that. Tried a few things so far, none particularly pretty. What would you say about removing the restriction of one process per Module? So you could instantiate the same crawler with e.g. a different base url and run multiple instances.

@oltarasenko
Copy link
Collaborator Author

Hi @maxlorenz could you explain what is your use case, maybe I will be able to advise something?

Basically, the base URL is only used by the same domain middleware. You can have a different version of it, in this case. In any case, I would ask for more info!

@maxlorenz
Copy link

Sure. I just want to have one spider module which fetches the base_url from a data source like the database. And I want to run multiple instances of that spider with different database ids to get the base_url from

@oltarasenko
Copy link
Collaborator Author

@maxlorenz I will be definitely heading towards a more dynamic configuration of spiders.

Regarding your case, it looks like what you really need is a monitoring website, which just follows links that were set in the database. Do I understand your case correctly?

@maxlorenz
Copy link

Exactly. I can already fetch the base URL inside the spider but I can’t start two instances or pass in arguments when the spider is starting

@oltarasenko
Copy link
Collaborator Author

Hey @maxlorenz think it's already possible to do it with current crawly. Let me think of an example to publish, as I am constantly thinking about the additional content which might make crawly more simple for the end user. I will try to produce it as soon as I can, which looking on my current schedule can be achieved this week. I will write to you once I will get some data to share!

@maxlorenz
Copy link

That would be very helpful!

@oltarasenko
Copy link
Collaborator Author

@maxlorenz But just in case, are you using extracting data from the same website all the time or from multiple websites?

@oltarasenko
Copy link
Collaborator Author

What confuses me here is:

  1. If all the links are from the same site, potentially you have no issues with a base URL, you just get it.
  2. If it's not the case then it's not clear how do you define item extractors capable of extracting data from different websites.

@maxlorenz
Copy link

From multiple websites. I am just interested in a few <span>s which means I can fetch their classes/ids from the database

@maxlorenz
Copy link

I'll further process the data later in my pipeline

@oltarasenko
Copy link
Collaborator Author

So in general you have a pre-defined set of rules stored in your database. E.g. if I am on siteA I am fetching a rule for siteA and so on, right?

@maxlorenz
Copy link

Exactly

@oltarasenko
Copy link
Collaborator Author

maybe please list example sites for the article

@maxlorenz
Copy link

I don’t have a list at Hand yet. Any site will do, just imagine I want to get all the h1-h5 tags (their content) but I will add more spiders during runtime

@maxlorenz
Copy link

I found a workaround using Module.create

@Ziinc
Copy link
Collaborator

Ziinc commented Dec 10, 2020

Please refer to the 3 draft PRs for main ideas from my previous comment.

Specifically, the runtime spider pr (#148) aims to avoid creating modules at runtime by allowing spider modules to be templates.

The parse pipelines pr (#150) aims to allow for configurable data extraction from responses.

@maxlorenz
Copy link

@oltarasenko @Ziinc I managed to use quote + Module.create which works. Only issue for now is that the created spiders won't show up with Crawly.Engine.list_known_spiders() since it can't find the module. Request/Storage stats works and I can use the engine sup's children to query for the running spiders. Let me know if I can help to extend the list function

@oltarasenko
Copy link
Collaborator Author

Hey, @maxlorenz yes it will work. FOR E.g. I am doing the same on CrawlyUI side for now. Regarding the listing function - I think a way to go would be to extend Engine API so it's possible to register a spider there.

@oltarasenko
Copy link
Collaborator Author

Hi, again @maxlorenz. As I have promised here is an article on how to do price monitoring across a list of websites: https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64?sk=9788899eb8e1d1dd6614d022eda350e8
(The link to medium contains a so-called friend code, so should not require a subscription to read it). Please let me know what do you think?!

@maxlorenz
Copy link

Very helpful! I got my project working:

I fetch the running spiders here (https://github.com/abotkit/teddy/blob/main/lib/teddy/spiders.ex#L40) and use the Module.create function like this: https://github.com/abotkit/teddy/blob/main/lib/teddy/spiders/spider.ex#L4

There's lots of room for improvement and I'd like to explore the S3 export. I'd probably want to buffer the results before writing to S3 to minize the traffic as well as to detect when a crawler is done and auto-stop the process. But it works for now 🎉

Thanks for the input!

@maxlorenz
Copy link

maxlorenz commented Dec 13, 2020

I used CrawlyUI as a reference

@oltarasenko
Copy link
Collaborator Author

@Ziinc, @maxlorenz now it's possible to inspect logs in UI :). Finally :)
http://crawlyui.com/logs/248/list?logs_filter=manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants