Announcement: Now it's possible to create a spider in UI #141

oltarasenko · 2020-11-23T21:54:18Z

Hey people,

I want to make a short announcement in the inner circle, before moving forward in other places. I have just released an alpha version of UI based spider generators. You can find them here: http://crawlyui.com/spider/new

And here you can find an example of a spider created this way: http://crawlyui.com/spider/new?spider=ErlangSolutionsBlog

Would really appreciate feedback & ideas: @Ziinc

oltarasenko · 2020-11-23T22:39:15Z

OK, I have managed to break everything...

oltarasenko · 2020-11-23T22:46:53Z

Ok, it works again

Ziinc · 2020-11-24T18:26:13Z

@oltarasenko I'm a little confused as to how the spiders are dynamically created. From skimming the crawly_ui source, I notice that there isn't a base template spider module being used.

As mentioned previously, managing dynamic spiders is definitely very useful, because managing and monitoring dozens of spiders manually is quite a hassle.

Could you give a brief explanation on how the spider is being dynamically instantiated and passed to the crawly engine? I understand that the extraction rules and init is stored in ecto, but what about the spider-level settings? and how does the engine keep track of it, considering that the spider discovery is based on module behaviour implementation?

oltarasenko · 2020-11-25T07:01:24Z

Hey @Ziinc the feature is quite experimental atm, and some parts would have to be improved (as they look ugly). However I wanted to build this feature, and it feels like it's almost working. Once done, I will be able to re-write it, to make it look normal from the coding point of view. (Of course, if there will be interest in this feature after all. Actually, I've noticed that in general, the interest in scraping and crawling is quite limited in the Elixir community, so it may be the case that these efforts were not needed :().

Regarding the implementation:

Currently we have one spider template, which is defined as a string for now (🤮 ). And the idea is that a spider module is generated on the fly from the string + data from the database (https://github.com/oltarasenko/crawly_ui/blob/master/lib/crawly_ui.ex#L30). I want to make a working version here, and then I will try to find a better way..
Once the create_spider function is called, the module is generated from the string and dynamically loaded into a worker node, so the worker can pick it up
Finally once code is loaded the UI schedules it, similarly as it does it for other spiders.

Other questions:

For now spider level settings are also in the same module as the extraction rules. However, in the future, I plan to allow crawly_ui to schedule a spider with given settings (so settings will not be a part of a given spider anymore)
The engine, for now, does not keep track of it, as the module is not a part of the filesystem of the worker node. So if the worker goes down, UI will have to reload it. Which I think makes sense. In general, I wanted UI to be able to cross schedule everything on all worker nodes added to it. So it can keep a registry of all spiders known by every node. Otherwise, we can also register a loaded spider in the Engine if it's API is extended.

I think that the most interesting part was done on the create spider UI, as it allows you to preview how your item extractors work in the realtime.

Ziinc · 2020-11-28T15:35:32Z

I think there are two major aspects to this whole issue:

spider management (start/stop), scheduling,and monitoring (ties in with Improvements for spider management #37 and Splitting spider logs #101)
runtime spider creation (create/update/delete)

For me, the key problems I have right now with spider management, is that when webmasters change their design, the spider's extraction phase becomes broken and there isn't a simple way to quickly fix it at runtime.

For aspect 1, i think that the monitoring capabilities are a little lacking in crawly's current state. Scheduling is largely out-of-scope, and management is satisfactory so far. I think having a way to stream logs to a client would be a great core feature to have to resolve #101 , or alternatively realtime aggregated log data. Perhaps we can leverage Telemetry to achieve that in both the core and the crawly_ui repo, so then people using crawly can plug their custom ui's into the crawly telemetry backend easily and browse historical logs, or filter to view dropped items, etc.

This should allow for easy debugging of broken spiders during runtime and allow altering of the extraction rules on-the-fly.

For aspect 2, as a first iteration, i think the attempt at the extraction rules is quite good. I think allowing the users to inject spider code as a string at runtime isn't a very secure route to go. I was thinking that we could perhaps have a protocol focused on response extraction. Crawly can provide a few optional extraction pipelines to extract out ParseItems and new Requests, and this can be declared in the settings.

Because of the recent initialization options PR for #136, providing dynamic extraction configurations (through the pipelines concept) would not be an issue, since said initialization options can then be used in the override_settings callback.

For example, a runtime spider could be defined as so:

def MySpiderTemplate do
  @impl Crawly.Spider
  def init(opts)
    # opts comes from stored state, such as a db
    # can set the spider name. Spiders are now referred to by this name, instead of module atom.
    [name: opts.name, start_urls: opts.start_urls]
  end
  def override_settings(opts) do
  # opts is merged list  of init options args, init result
  # since init opts contains stored state, may contain additional extraction configuration
  [parse: [
    # crawly loads the request's response into a protocol struct, and pipes it through the declared modules. 
    {ExtractRequests, glob: "/products/*"},
    # xpath config for this spider is stored in db
    {ExtractRequests, xpath: opts.request_xpath},
    {ExtractItems, rules: [ ... ]  }
  ]]
  end
  # if the :parse setting key is given, then no need to implement the parse callback
end

This should give enough flexibility for the developer to create spider templates for different situations, as well as configure the loading of the spider's configuration data through their own storage methods.

Ziinc · 2020-11-28T15:44:02Z

w.r.t. the demand of this feature, i would say that there is a high demand for web crawling in general. Should you eventually productize this (I highly recommend that you do so), creating a way to easily create, manage, and deploy hundreds of spiders at scale is definitely something that many businesses need. The elixir community may be small, but crawly is currently the "go-to" solution for web scraping, judging by the hex downloads compared to crawler.

maxlorenz · 2020-12-06T14:41:51Z

I'd also be very interested in how to achieve that. Tried a few things so far, none particularly pretty. What would you say about removing the restriction of one process per Module? So you could instantiate the same crawler with e.g. a different base url and run multiple instances.

oltarasenko · 2020-12-07T14:56:01Z

Hi @maxlorenz could you explain what is your use case, maybe I will be able to advise something?

Basically, the base URL is only used by the same domain middleware. You can have a different version of it, in this case. In any case, I would ask for more info!

maxlorenz · 2020-12-08T04:04:14Z

Sure. I just want to have one spider module which fetches the base_url from a data source like the database. And I want to run multiple instances of that spider with different database ids to get the base_url from

oltarasenko · 2020-12-08T07:48:30Z

@maxlorenz I will be definitely heading towards a more dynamic configuration of spiders.

Regarding your case, it looks like what you really need is a monitoring website, which just follows links that were set in the database. Do I understand your case correctly?

maxlorenz · 2020-12-08T11:16:21Z

Exactly. I can already fetch the base URL inside the spider but I can’t start two instances or pass in arguments when the spider is starting

oltarasenko · 2020-12-08T11:22:47Z

Hey @maxlorenz think it's already possible to do it with current crawly. Let me think of an example to publish, as I am constantly thinking about the additional content which might make crawly more simple for the end user. I will try to produce it as soon as I can, which looking on my current schedule can be achieved this week. I will write to you once I will get some data to share!

maxlorenz · 2020-12-08T17:00:40Z

That would be very helpful!

oltarasenko · 2020-12-08T19:11:12Z

@maxlorenz But just in case, are you using extracting data from the same website all the time or from multiple websites?

oltarasenko · 2020-12-08T19:40:36Z

What confuses me here is:

If all the links are from the same site, potentially you have no issues with a base URL, you just get it.
If it's not the case then it's not clear how do you define item extractors capable of extracting data from different websites.

maxlorenz · 2020-12-09T10:42:16Z

From multiple websites. I am just interested in a few <span>s which means I can fetch their classes/ids from the database

maxlorenz · 2020-12-09T10:42:44Z

I'll further process the data later in my pipeline

oltarasenko · 2020-12-09T10:54:01Z

So in general you have a pre-defined set of rules stored in your database. E.g. if I am on siteA I am fetching a rule for siteA and so on, right?

maxlorenz · 2020-12-09T11:25:31Z

Exactly

oltarasenko · 2020-12-09T11:43:34Z

maybe please list example sites for the article

maxlorenz · 2020-12-09T16:32:04Z

I don’t have a list at Hand yet. Any site will do, just imagine I want to get all the h1-h5 tags (their content) but I will add more spiders during runtime

maxlorenz · 2020-12-10T04:45:56Z

I found a workaround using Module.create

Ziinc · 2020-12-10T09:53:07Z

Please refer to the 3 draft PRs for main ideas from my previous comment.

Specifically, the runtime spider pr (#148) aims to avoid creating modules at runtime by allowing spider modules to be templates.

The parse pipelines pr (#150) aims to allow for configurable data extraction from responses.

maxlorenz · 2020-12-11T01:09:57Z

@oltarasenko @Ziinc I managed to use quote + Module.create which works. Only issue for now is that the created spiders won't show up with Crawly.Engine.list_known_spiders() since it can't find the module. Request/Storage stats works and I can use the engine sup's children to query for the running spiders. Let me know if I can help to extend the list function

oltarasenko · 2020-12-11T07:38:53Z

Hey, @maxlorenz yes it will work. FOR E.g. I am doing the same on CrawlyUI side for now. Regarding the listing function - I think a way to go would be to extend Engine API so it's possible to register a spider there.

oltarasenko · 2020-12-11T09:16:09Z

Hi, again @maxlorenz. As I have promised here is an article on how to do price monitoring across a list of websites: https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64?sk=9788899eb8e1d1dd6614d022eda350e8
(The link to medium contains a so-called friend code, so should not require a subscription to read it). Please let me know what do you think?!

maxlorenz · 2020-12-13T03:30:06Z

Very helpful! I got my project working:

I fetch the running spiders here (https://github.com/abotkit/teddy/blob/main/lib/teddy/spiders.ex#L40) and use the Module.create function like this: https://github.com/abotkit/teddy/blob/main/lib/teddy/spiders/spider.ex#L4

There's lots of room for improvement and I'd like to explore the S3 export. I'd probably want to buffer the results before writing to S3 to minize the traffic as well as to detect when a crawler is done and auto-stop the process. But it works for now 🎉

Thanks for the input!

maxlorenz · 2020-12-13T03:31:45Z

I used CrawlyUI as a reference

oltarasenko · 2020-12-16T11:56:46Z

@Ziinc, @maxlorenz now it's possible to inspect logs in UI :). Finally :)
http://crawlyui.com/logs/248/list?logs_filter=manager

oltarasenko assigned Ziinc Nov 23, 2020

This was referenced Dec 10, 2020

Runtime spider creation #148

Closed

Telemetry metrics support #149

Closed

Parse pipelines #150

Merged

oltarasenko closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Announcement: Now it's possible to create a spider in UI #141

Announcement: Now it's possible to create a spider in UI #141

oltarasenko commented Nov 23, 2020

oltarasenko commented Nov 23, 2020

oltarasenko commented Nov 23, 2020

Ziinc commented Nov 24, 2020

oltarasenko commented Nov 25, 2020

Ziinc commented Nov 28, 2020

Ziinc commented Nov 28, 2020

maxlorenz commented Dec 6, 2020

oltarasenko commented Dec 7, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

oltarasenko commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

oltarasenko commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

maxlorenz commented Dec 10, 2020

Ziinc commented Dec 10, 2020

maxlorenz commented Dec 11, 2020

oltarasenko commented Dec 11, 2020

oltarasenko commented Dec 11, 2020

maxlorenz commented Dec 13, 2020

maxlorenz commented Dec 13, 2020 •

edited

oltarasenko commented Dec 16, 2020

Announcement: Now it's possible to create a spider in UI #141

Announcement: Now it's possible to create a spider in UI #141

Comments

oltarasenko commented Nov 23, 2020

oltarasenko commented Nov 23, 2020

oltarasenko commented Nov 23, 2020

Ziinc commented Nov 24, 2020

oltarasenko commented Nov 25, 2020

Ziinc commented Nov 28, 2020

Ziinc commented Nov 28, 2020

maxlorenz commented Dec 6, 2020

oltarasenko commented Dec 7, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

oltarasenko commented Dec 8, 2020

maxlorenz commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

oltarasenko commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

oltarasenko commented Dec 9, 2020

maxlorenz commented Dec 9, 2020

maxlorenz commented Dec 10, 2020

Ziinc commented Dec 10, 2020

maxlorenz commented Dec 11, 2020

oltarasenko commented Dec 11, 2020

oltarasenko commented Dec 11, 2020

maxlorenz commented Dec 13, 2020

maxlorenz commented Dec 13, 2020 • edited

oltarasenko commented Dec 16, 2020

maxlorenz commented Dec 13, 2020 •

edited