New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possibility to add scraper plugins #198
Comments
Hmm that's definitely a good idea, I'll have to think about it 😄 |
looking at it, there are two ideas in this proposal, one for custom scrapers and one for dynamic content creation based on stored pages with a specific tag. |
I use wallabag and they piggyback on fivefilters Maybe could cook up a parser to read extraction pattern and feed it to bs4: https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup |
Hmm that's really interesting, I'll check it out, thanks for the recommendation! |
I'm thinking I could add support for a user-facing python file where the user could define a set of "patterns", when one of these patterns is matched, instead of going with the default method archivy uses the user would write code to process this special case himself. |
This is on the way! I've just been busy lately :) |
great to hear, looking forward what you implement ;) |
What I currently have implemented is setup like this: There's a user-facing def fun(data):
# access url with data.url
data.title = <>
data.content = <>
# modify / fetch whatever you like
# here you specify the pattern you want to match with a given function
PATTERNS = {
"test": fun
} Any urls that have |
I have to document / test things a bit more, and I'm open to suggestions |
@questor is this the type of implementation you were looking for? I'm open to feedback ! |
See #243 |
thanks for putting it in, I have seen it but up to now had no time to really test the feature and gather experience with the approach. |
This can now be closed. |
would be cool to have custom scrapers for certain domains, for example if the url is a youtube video download that video via a custom tool or if it's a github url clone the project.
this request is not about the plugins themself but the needed changes in archivy themself to filter urls and send them to specific plugins when they registered themself as extractors.
when the plugin automatically adds tags an idea would be to be able to have generic pages with dynamic content generated based on these tags. as an example: when using the youtube-downloader (and adding youtube-tags) one special page could be a page with all videos downloaded (by having a filter on the page to show all pages with the tag "youtube" or "video"). and more generic with tag-filtering you could make special dynamic pages (headline and all pages with "youtube" and "tutorial", below the next headline and the tags "youtube" and "sports" and so on).
sorry if that sounds confusing, hard to describe. what do you think about the ideas? I'm not an experienced python-coder with not much time, but could try to help you out on this if it fits your vision of the tool.
edit: fixed some spelling...
The text was updated successfully, but these errors were encountered: