Skip to content
This repository has been archived by the owner on Nov 25, 2023. It is now read-only.

White list / black list websites, robots.txt pre-sets #5

Closed
d47081 opened this issue Apr 7, 2023 · 3 comments
Closed

White list / black list websites, robots.txt pre-sets #5

d47081 opened this issue Apr 7, 2023 · 3 comments
Labels
question Further information is requested yggdrasil

Comments

@d47081
Copy link
Collaborator

d47081 commented Apr 7, 2023

So, trackers with external seeders is shit inside the network

Nice start..

I mean this subject for the websites we need to crawl and some maybe a mirrors we need to block or limit by the crawlPageLimit/CRAWL_HOST_DEFAULT_PAGES_LIMIT

Ideas here, just few relevant relations
#1 (comment)

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

@d47081 d47081 added the question Further information is requested label Apr 7, 2023
@ygguser
Copy link

ygguser commented Apr 7, 2023

And I would to ask, do we need to enable the GitHub Discussions page, or do Issues to resolve, not talk.

Perhaps it would be better to chat and discuss the development in "Discussions", and use this section to solve existing (already implemented :)) problems, as well as consider user requests.
I think it would be more traditional for GitHub.

@d47081 d47081 modified the milestone: A goal Apr 7, 2023
@d47081 d47081 changed the title White list / black list websites [here] White list / black list websites, robots.txt pre-sets [Yggdrasil only] Apr 8, 2023
@d47081 d47081 changed the title White list / black list websites, robots.txt pre-sets [Yggdrasil only] White list / black list websites, robots.txt pre-sets [Yggdrasil] Apr 8, 2023
@d47081 d47081 changed the title White list / black list websites, robots.txt pre-sets [Yggdrasil] White list / black list websites, robots.txt pre-sets Apr 8, 2023
d47081 pushed a commit that referenced this issue Apr 8, 2023
d47081 pushed a commit that referenced this issue Apr 8, 2023
d47081 pushed a commit that referenced this issue Apr 8, 2023
@d47081
Copy link
Collaborator Author

d47081 commented Apr 8, 2023

Well, for this subject have implemented new feature that relates to the hostPage.robotsPostfix field in the database plus new configuration option available:

/*
 * Permanent rules that append to the robots.txt if exists else CRAWL_ROBOTS_DEFAULT_RULES
 * The crawler does not overwrite these rules
 *
 * Presets
 * yggdrasil: /database/yggdrasil/host.robotsPostfix.md
 *
 */
define('CRAWL_ROBOTS_POSTFIX_RULES', null); // string|null

In few words, we can append extra robots.txt rules in to the hostPage.robotsPostfix field, and these data will not be overwritten by the remote one, on auto-update.

For the white-blacklist needs we don't need the any of new features implementation, because can simply disable specific domain for it pages crawling and indexing in the host.status field.

And finally, to close this subject, I have created database configuration preset, where everyone can contribute the propositions.
Because me using this engine for Yggdrasil network scanning, I have separated this registry into the relative folder (because engine could be used for other networks also)

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

@d47081 d47081 closed this as completed Apr 8, 2023
d47081 pushed a commit that referenced this issue Apr 8, 2023
@d47081
Copy link
Collaborator Author

d47081 commented May 3, 2023

https://github.com/YGGverse/YGGo/tree/main/database/yggdrasil

just for a note, those data sets are depending of crawler configuration so have moved these variables to the manifest API, where each the application able to grab the data match to it specific requirements

I work on the distributed ecosystem, so for right now it's
<meta name="yggo" content="/yggo/api.php?action=manifest" />

This option could be enabled by node owner with API_ENABLED + API_MANIFEST_ENABLED settings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested yggdrasil
Projects
None yet
Development

No branches or pull requests

2 participants