-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Announcement: Now it's possible to create a spider in UI #141
Comments
OK, I have managed to break everything... |
Ok, it works again |
@oltarasenko I'm a little confused as to how the spiders are dynamically created. From skimming the As mentioned previously, managing dynamic spiders is definitely very useful, because managing and monitoring dozens of spiders manually is quite a hassle. Could you give a brief explanation on how the spider is being dynamically instantiated and passed to the crawly engine? I understand that the extraction rules and init is stored in ecto, but what about the spider-level settings? and how does the engine keep track of it, considering that the spider discovery is based on module behaviour implementation? |
Hey @Ziinc the feature is quite experimental atm, and some parts would have to be improved (as they look ugly). However I wanted to build this feature, and it feels like it's almost working. Once done, I will be able to re-write it, to make it look normal from the coding point of view. (Of course, if there will be interest in this feature after all. Actually, I've noticed that in general, the interest in scraping and crawling is quite limited in the Elixir community, so it may be the case that these efforts were not needed :(). Regarding the implementation:
Other questions:
I think that the most interesting part was done on the create spider UI, as it allows you to preview how your item extractors work in the realtime. |
I think there are two major aspects to this whole issue:
For me, the key problems I have right now with spider management, is that when webmasters change their design, the spider's extraction phase becomes broken and there isn't a simple way to quickly fix it at runtime. For aspect 1, i think that the monitoring capabilities are a little lacking in crawly's current state. Scheduling is largely out-of-scope, and management is satisfactory so far. I think having a way to stream logs to a client would be a great core feature to have to resolve #101 , or alternatively realtime aggregated log data. Perhaps we can leverage This should allow for easy debugging of broken spiders during runtime and allow altering of the extraction rules on-the-fly. For aspect 2, as a first iteration, i think the attempt at the extraction rules is quite good. I think allowing the users to inject spider code as a string at runtime isn't a very secure route to go. I was thinking that we could perhaps have a protocol focused on response extraction. Crawly can provide a few optional extraction pipelines to extract out ParseItems and new Requests, and this can be declared in the settings. Because of the recent initialization options PR for #136, providing dynamic extraction configurations (through the pipelines concept) would not be an issue, since said initialization options can then be used in the override_settings callback. For example, a runtime spider could be defined as so: def MySpiderTemplate do
@impl Crawly.Spider
def init(opts)
# opts comes from stored state, such as a db
# can set the spider name. Spiders are now referred to by this name, instead of module atom.
[name: opts.name, start_urls: opts.start_urls]
end
def override_settings(opts) do
# opts is merged list of init options args, init result
# since init opts contains stored state, may contain additional extraction configuration
[parse: [
# crawly loads the request's response into a protocol struct, and pipes it through the declared modules.
{ExtractRequests, glob: "/products/*"},
# xpath config for this spider is stored in db
{ExtractRequests, xpath: opts.request_xpath},
{ExtractItems, rules: [ ... ] }
]]
end
# if the :parse setting key is given, then no need to implement the parse callback
end This should give enough flexibility for the developer to create spider templates for different situations, as well as configure the loading of the spider's configuration data through their own storage methods. |
w.r.t. the demand of this feature, i would say that there is a high demand for web crawling in general. Should you eventually productize this (I highly recommend that you do so), creating a way to easily create, manage, and deploy hundreds of spiders at scale is definitely something that many businesses need. The elixir community may be small, but crawly is currently the "go-to" solution for web scraping, judging by the hex downloads compared to crawler. |
I'd also be very interested in how to achieve that. Tried a few things so far, none particularly pretty. What would you say about removing the restriction of one process per Module? So you could instantiate the same crawler with e.g. a different base url and run multiple instances. |
Hi @maxlorenz could you explain what is your use case, maybe I will be able to advise something? Basically, the base URL is only used by the same domain middleware. You can have a different version of it, in this case. In any case, I would ask for more info! |
Sure. I just want to have one spider module which fetches the base_url from a data source like the database. And I want to run multiple instances of that spider with different database ids to get the base_url from |
@maxlorenz I will be definitely heading towards a more dynamic configuration of spiders. Regarding your case, it looks like what you really need is a monitoring website, which just follows links that were set in the database. Do I understand your case correctly? |
Exactly. I can already fetch the base URL inside the spider but I can’t start two instances or pass in arguments when the spider is starting |
Hey @maxlorenz think it's already possible to do it with current crawly. Let me think of an example to publish, as I am constantly thinking about the additional content which might make crawly more simple for the end user. I will try to produce it as soon as I can, which looking on my current schedule can be achieved this week. I will write to you once I will get some data to share! |
That would be very helpful! |
@maxlorenz But just in case, are you using extracting data from the same website all the time or from multiple websites? |
What confuses me here is:
|
From multiple websites. I am just interested in a few |
I'll further process the data later in my pipeline |
So in general you have a pre-defined set of rules stored in your database. E.g. if I am on siteA I am fetching a rule for siteA and so on, right? |
Exactly |
maybe please list example sites for the article |
I don’t have a list at Hand yet. Any site will do, just imagine I want to get all the h1-h5 tags (their content) but I will add more spiders during runtime |
I found a workaround using Module.create |
@oltarasenko @Ziinc I managed to use |
Hey, @maxlorenz yes it will work. FOR E.g. I am doing the same on CrawlyUI side for now. Regarding the listing function - I think a way to go would be to extend Engine API so it's possible to register a spider there. |
Hi, again @maxlorenz. As I have promised here is an article on how to do price monitoring across a list of websites: https://oltarasenko.medium.com/using-elixir-and-crawly-for-price-monitoring-7364d345fc64?sk=9788899eb8e1d1dd6614d022eda350e8 |
Very helpful! I got my project working: I fetch the running spiders here (https://github.com/abotkit/teddy/blob/main/lib/teddy/spiders.ex#L40) and use the There's lots of room for improvement and I'd like to explore the S3 export. I'd probably want to buffer the results before writing to S3 to minize the traffic as well as to detect when a crawler is done and auto-stop the process. But it works for now 🎉 Thanks for the input! |
I used CrawlyUI as a reference |
@Ziinc, @maxlorenz now it's possible to inspect logs in UI :). Finally :) |
Hey people,
I want to make a short announcement in the inner circle, before moving forward in other places. I have just released an alpha version of UI based spider generators. You can find them here: http://crawlyui.com/spider/new
And here you can find an example of a spider created this way: http://crawlyui.com/spider/new?spider=ErlangSolutionsBlog
Would really appreciate feedback & ideas: @Ziinc
The text was updated successfully, but these errors were encountered: