Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facebook blocks all bots via robots.txt - Don't try to snapshot facebook URLs #45

Open
jerclarke opened this issue Jan 11, 2018 · 2 comments
Assignees

Comments

@jerclarke
Copy link

Okay this is extremely similar to #44 so please read that first.

In this case the answer is even simpler though Just stop trying to fetch Facebook URLs at all

They don't work and they will never work as long as we obey Facebook's robots.txt which has the following:

# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
[..]
User-agent: *
Disallow: /

This seems pretty unambiguous to me, and because our site does a lot of linking to Facebook posts, there are a thousand failed attempts to snapshot their URLs clogging up our queue.

Can we just reject facebook URLs out of hand and have them skip the queue?

As with Twitter, it would be good to have a WP filter available for sites to quickly add domains that should be ignored completely by Amber.

I feel Facebook should be included in such a list for all plugin users, but the ability to filter them out for our site in particular is the most vital need.

Thanks for your attention and help.

@jerclarke
Copy link
Author

Okay, so after further testing, we discovered that this is already possible through the Excluded_URL_patterns setting described here:

https://github.com/berkmancenter/amber_wordpress/wiki/Configuration#Excluded_URL_patterns

The problem with this: It is only visible in the Amber Settings page when the Local backend is selected. When we have Internet Archive as our backend, it is completely hidden.

While I understand the logic behind this behavior (it is far more crucial when your own servers are storing the results, which might be worthless or redundant for any given domain) I think it should be changed.

All users may want to exclude certain URLs for a variety of reasons, not least of which is the Facebook example above.

AFAICT the exclusion blacklist even works regardless of the storage backend being used, so it's only the settings UI that needs to be changed.

If you don't want to change the settings page to make the blacklist show regardless of backend, I recommend adding a note to the wiki page linked above, clarifying that the box only shows for local storage. As-is it's very confusing (I had in fact read that section of the documentation, but forgot about it because the option wasn't visible on the settings screen, which I did check before creating this ticket).

@jlicht jlicht self-assigned this Mar 16, 2018
@jerclarke
Copy link
Author

jerclarke commented Jun 28, 2018

UPDATE: We've been running it for awhile but while I'm back here I'll drop in this temporary fix that solves the same problem via WP filters:

/**
 * Filter the return value for "get_option" for the "amber_options" option to add Facebook to the excluded sites.
 *
 * This is necessary since the excluded site list is only displayed when the storage is set
 * to "local" in the settings page. But it still affects whether or not a page will get queued
 * for caching anyways.
 *
 * @see github issue: https://github.com/berkmancenter/amber_wordpress/issues/45
 *
 * @param mixed $amber_options
 *
 * @return mixed
 */
function gv_amber_add_facebook_to_excluded_sites($amber_options) {
	if (!is_array($amber_options)
		|| (!empty($amber_options['amber_excluded_sites']) && false !== stripos($amber_options['amber_excluded_sites'], 'facebook.com')))
		return $amber_options;

	if (!empty($amber_options['amber_excluded_sites']))
		$amber_options['amber_excluded_sites'] .= ',';
	elseif (!isset($amber_options['amber_excluded_sites']))
		$amber_options['amber_excluded_sites'] = '';

	$amber_options['amber_excluded_sites'] .= 'facebook.com';

	return $amber_options;
}
add_filter('option_amber_options', 'gv_amber_add_facebook_to_excluded_sites');

This one specifically solves the problem for Facebook, but could be used to add any domains you want in the second-last line.

Most of the logic is just to ensure there's no PHP notices and that the commas are always where they need to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants