Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibility to add scraper plugins #198

Closed
questor opened this issue Feb 3, 2021 · 13 comments
Closed

possibility to add scraper plugins #198

questor opened this issue Feb 3, 2021 · 13 comments

Comments

@questor
Copy link

questor commented Feb 3, 2021

would be cool to have custom scrapers for certain domains, for example if the url is a youtube video download that video via a custom tool or if it's a github url clone the project.

this request is not about the plugins themself but the needed changes in archivy themself to filter urls and send them to specific plugins when they registered themself as extractors.

when the plugin automatically adds tags an idea would be to be able to have generic pages with dynamic content generated based on these tags. as an example: when using the youtube-downloader (and adding youtube-tags) one special page could be a page with all videos downloaded (by having a filter on the page to show all pages with the tag "youtube" or "video"). and more generic with tag-filtering you could make special dynamic pages (headline and all pages with "youtube" and "tutorial", below the next headline and the tags "youtube" and "sports" and so on).

sorry if that sounds confusing, hard to describe. what do you think about the ideas? I'm not an experienced python-coder with not much time, but could try to help you out on this if it fits your vision of the tool.

edit: fixed some spelling...

@Uzay-G
Copy link
Member

Uzay-G commented Feb 4, 2021

Hmm that's definitely a good idea, I'll have to think about it 😄

@questor
Copy link
Author

questor commented Feb 5, 2021

looking at it, there are two ideas in this proposal, one for custom scrapers and one for dynamic content creation based on stored pages with a specific tag.

@kahnwong
Copy link

I use wallabag and they piggyback on fivefilters

Maybe could cook up a parser to read extraction pattern and feed it to bs4: https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup

@Uzay-G
Copy link
Member

Uzay-G commented Feb 17, 2021

Hmm that's really interesting, I'll check it out, thanks for the recommendation!

@Uzay-G
Copy link
Member

Uzay-G commented Feb 18, 2021

I'm thinking I could add support for a user-facing python file where the user could define a set of "patterns", when one of these patterns is matched, instead of going with the default method archivy uses the user would write code to process this special case himself.

@Uzay-G Uzay-G pinned this issue Feb 23, 2021
@Uzay-G
Copy link
Member

Uzay-G commented Jun 10, 2021

This is on the way! I've just been busy lately :)

@questor
Copy link
Author

questor commented Jun 11, 2021

great to hear, looking forward what you implement ;)

@Uzay-G
Copy link
Member

Uzay-G commented Jun 25, 2021

What I currently have implemented is setup like this:

There's a user-facing scraping.py file where you specify regex patterns like this:

def fun(data):
   # access url with data.url
    data.title = <>
    data.content = <>
    # modify / fetch whatever you like

# here you specify the pattern you want to match with a given function
PATTERNS = {
    "test": fun
}

Any urls that have test in them will match and call fun on the data instead of the usual behaviour.

@Uzay-G
Copy link
Member

Uzay-G commented Jun 25, 2021

I have to document / test things a bit more, and I'm open to suggestions

@Uzay-G
Copy link
Member

Uzay-G commented Jun 27, 2021

@questor is this the type of implementation you were looking for? I'm open to feedback !

@Uzay-G
Copy link
Member

Uzay-G commented Jul 1, 2021

See #243

@questor
Copy link
Author

questor commented Jul 2, 2021

thanks for putting it in, I have seen it but up to now had no time to really test the feature and gather experience with the approach.

@Uzay-G
Copy link
Member

Uzay-G commented Aug 17, 2021

This can now be closed.

@Uzay-G Uzay-G closed this as completed Aug 17, 2021
@Uzay-G Uzay-G unpinned this issue Dec 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants