-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crawler to get texts from websites #775
Conversation
Hey @DIVYA-19 , Thanks for working on this! I really think this is a valuable addition to Haystack. I am currently digging into your proposed implementation and a few early questions / comments came up:
|
No. not exactly. But it's very easy to use. And I'm familiar with it. I have thought the same about the driver path.
yeah! creating a module would be nice. I'll do that
It'll be a good feature. But I have a doubt. When to stop looking for urls?
sounds good. we can do that |
Hi @tholor |
Ok, I see. From a quick look, scrapy-splash doesn't look better than Selenium in terms of dependencies as it requires running an external Docker container. I'd say let's go forward with Selenium and let's try to make the configuration + installation of the webdriver as simple as possible for the users. |
@DIVYA-19 I saw that you already refactored some parts in the PR. Let me know if I shall do another review round :) |
yeah @tholor! I think it would be nice if you review and give some suggests. I couldn't update you on changes in PR since it was failing some checks.
|
@DIVYA-19 I just did one review round and pushed a few changes that seemed helpful from my perspective:
Hope this makes sense from your perspective @DIVYA-19 ? @tanaysoni Please review and let me know your thoughts on the integration into pipelines. Especially, I was wondering whether we want to put most of the params rather in the class init or the |
yes @tholor that totally makes sense to me too |
@tholor is there any plan to include this in next release? I am trying this with confluence will share results. |
@lalitpagaria yep, will be merged soon. Just waiting for @oryx1729 's review. |
I just came here after a nice call with @PiffPaffM which pointed me this PR. It's a really nice feature that I also wanted and implemented it on my own. I'm having some questions / ideas about that (from my experimentation and which can be included after) Putting only the text (l.133 & 134 in
|
Hey @ierezell , I think it would be very helpful to extract more of the structure of a website and use it for better chunking and additional metadata. Re scrapy vs selenium: I see pros and cons for both. Dynamic content was the killer argument for selenium in the current implementation, but as this is not needed in all use cases, scrapy might be a better choice for those. I like your suggestion of having multiple "collectors" that use similar parts further down the pipeline. We could have selenium and scrapy side by side and leave the choice to the end user. Would you be interested in creating a PR for this? I think your code is already in a pretty good shape. We'd just need the integration into Haystack and tackling some consistencies with the selenium crawler... |
My apologies for jumping in the discussion. I just want to add following suggestion - |
Hi @tholor, with pleasure ! The danger with extracting data is also to avoid trying to be a google (checking videos, images, relations and all search engine stuff which is useful but should be a bit out of focus for now). Else yes some text information can be essential and should be extracted. For scrapy and selenium, scrapy can be standalone and we can plug selenium on top of it (and scrapy deal with it auto magically. Something like For collectors, yes I think that because all website is different and can have subtilities, we could have a basic default parser but then user could replace only the parser with his own (which is easy to do and all the rest like cleaner, writer, etc... is is haystack made). I would be really interested but unfortunatelly I don't think I will have time for this (I should have done another one for deepset as well and I never found the time....). However I will answer any question/help if someone implements it. @lalitpagaria, thanks I didn't knew this library and indeed it seems not evolving anymore (last commit 10month ago) so we could use this library or implement something similar. It's a really nice inspiration to construct powerfull parsers. Hope you have a great day |
Hello
I tried to implement feature that is discussed in #770
all required functions are added to preprocessor/utils.py