possibility to add scraper plugins #198

questor · 2021-02-03T09:50:32Z

would be cool to have custom scrapers for certain domains, for example if the url is a youtube video download that video via a custom tool or if it's a github url clone the project.

this request is not about the plugins themself but the needed changes in archivy themself to filter urls and send them to specific plugins when they registered themself as extractors.

when the plugin automatically adds tags an idea would be to be able to have generic pages with dynamic content generated based on these tags. as an example: when using the youtube-downloader (and adding youtube-tags) one special page could be a page with all videos downloaded (by having a filter on the page to show all pages with the tag "youtube" or "video"). and more generic with tag-filtering you could make special dynamic pages (headline and all pages with "youtube" and "tutorial", below the next headline and the tags "youtube" and "sports" and so on).

sorry if that sounds confusing, hard to describe. what do you think about the ideas? I'm not an experienced python-coder with not much time, but could try to help you out on this if it fits your vision of the tool.

edit: fixed some spelling...

Uzay-G · 2021-02-04T19:00:27Z

Hmm that's definitely a good idea, I'll have to think about it 😄

questor · 2021-02-05T10:32:48Z

looking at it, there are two ideas in this proposal, one for custom scrapers and one for dynamic content creation based on stored pages with a specific tag.

kahnwong · 2021-02-17T05:50:34Z

I use wallabag and they piggyback on fivefilters

Maybe could cook up a parser to read extraction pattern and feed it to bs4: https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup

Uzay-G · 2021-02-17T09:28:15Z

Hmm that's really interesting, I'll check it out, thanks for the recommendation!

Uzay-G · 2021-02-18T10:07:57Z

I'm thinking I could add support for a user-facing python file where the user could define a set of "patterns", when one of these patterns is matched, instead of going with the default method archivy uses the user would write code to process this special case himself.

Uzay-G · 2021-06-10T20:06:50Z

This is on the way! I've just been busy lately :)

questor · 2021-06-11T07:00:26Z

great to hear, looking forward what you implement ;)

Uzay-G · 2021-06-25T23:21:01Z

What I currently have implemented is setup like this:

There's a user-facing scraping.py file where you specify regex patterns like this:

def fun(data):
   # access url with data.url
    data.title = <>
    data.content = <>
    # modify / fetch whatever you like

# here you specify the pattern you want to match with a given function
PATTERNS = {
    "test": fun
}

Any urls that have test in them will match and call fun on the data instead of the usual behaviour.

Uzay-G · 2021-06-25T23:24:38Z

I have to document / test things a bit more, and I'm open to suggestions

Uzay-G · 2021-06-27T16:24:04Z

@questor is this the type of implementation you were looking for? I'm open to feedback !

Uzay-G · 2021-07-01T14:06:59Z

See #243

questor · 2021-07-02T08:35:00Z

thanks for putting it in, I have seen it but up to now had no time to really test the feature and gather experience with the approach.

Uzay-G · 2021-08-17T19:21:02Z

This can now be closed.

Uzay-G pinned this issue Feb 23, 2021

Uzay-G mentioned this issue Feb 23, 2021

Archivy Roadmap #74

Open

Uzay-G mentioned this issue Jun 28, 2021

Notes Usability Suggestions #195

Open

Uzay-G mentioned this issue Jul 1, 2021

Custom scraping rules #243

Merged

Uzay-G closed this as completed Aug 17, 2021

Uzay-G unpinned this issue Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possibility to add scraper plugins #198

possibility to add scraper plugins #198

questor commented Feb 3, 2021 •

edited

Uzay-G commented Feb 4, 2021

questor commented Feb 5, 2021

kahnwong commented Feb 17, 2021

Uzay-G commented Feb 17, 2021

Uzay-G commented Feb 18, 2021

Uzay-G commented Jun 10, 2021

questor commented Jun 11, 2021

Uzay-G commented Jun 25, 2021

Uzay-G commented Jun 25, 2021

Uzay-G commented Jun 27, 2021

Uzay-G commented Jul 1, 2021

questor commented Jul 2, 2021

Uzay-G commented Aug 17, 2021

possibility to add scraper plugins #198

possibility to add scraper plugins #198

Comments

questor commented Feb 3, 2021 • edited

Uzay-G commented Feb 4, 2021

questor commented Feb 5, 2021

kahnwong commented Feb 17, 2021

Uzay-G commented Feb 17, 2021

Uzay-G commented Feb 18, 2021

Uzay-G commented Jun 10, 2021

questor commented Jun 11, 2021

Uzay-G commented Jun 25, 2021

Uzay-G commented Jun 25, 2021

Uzay-G commented Jun 27, 2021

Uzay-G commented Jul 1, 2021

questor commented Jul 2, 2021

Uzay-G commented Aug 17, 2021

questor commented Feb 3, 2021 •

edited