Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Received HTTP code 403 when trying to fetch a site using Cloudflare #316

Closed
clementbiron opened this issue Aug 16, 2021 · 11 comments
Closed
Assignees

Comments

@clementbiron
Copy link
Member

clementbiron commented Aug 16, 2021

Trying to add Roblox service and documents with the following declaration

{
  "name": "Roblox",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-",
      "select": [".article-body"],
      "remove": [".wysiwyg-text-align-right img"]
    },
    "Terms of Service": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use",
      "select": [".article"],
      "remove": [".article-relatives", ".article-footer"]
    },
    "Community Guidelines": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules",
      "select": [".article"],
      "remove": [".article-footer", ".article-relatives"]
    }
  }
}

I get this node error messages

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules'

@clementbiron
Copy link
Member Author

Same error trying to add Coinbase documents with following declaration

{
  "name": "Coinbase",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.coinbase.com/legal/privacy",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Trackers Policy": {
      "fetch": "https://www.coinbase.com/legal/cookie",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Terms of Service": {
      "fetch": "https://www.coinbase.com/legal/user_agreement/ireland_europe",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    }
  }
}

Content inacessible: Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.coinbase.com/legal/user_agreement/ireland_europe'

@martinratinaud martinratinaud changed the title Received HTTP code 403 when trying to fetch Received HTTP code 403 when trying to fetch a site using Cloudflare Aug 26, 2021
@martinratinaud
Copy link
Member

This is mainly because those sites are using a service like cloudflare to check their traffic

Our attempt to scrape is evaluated as a bot and thus is blocked by a 403.

I tried the following all these with no success

  • change user agent
  • use proxies from free-list-proxies
  • adding referrer
  • adding referrer policy
  • use cloudflare-bypasser from github

So I suggest for now that you use "executeClientScripts"

In the meantime, I've send a ticket request to Cloudflare through my personnal premium account. Let's see what they say

Hi, My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs.  

We are running the OpenSource project "Open Terms Archive" which aims at tracking ToS for every 
service in the world, in all languages and all countries.  
As such, we are implementing a crawler that tracks changes on ToS regularly.  
We know we are currently blocked by your services and would like our bot to be trusted 
by Cloudflare as a good bot (whitelisted) so that we are not blocked anymore 

Thanks a lot

Check our websites here: 
https://www.opentermsarchive.org/en 
https://disinfo.quaidorsay.fr/en

@martinratinaud
Copy link
Member

And here is the response of cloudflare

Hi there,

Thanks for contacting Cloudflare support. My name is Yuri and I will be looking into this ticket for you.

To add a bot to Cloudflare's allowlist, please submit this online application.

For more information, please see: Frequently asked questions about Cloudflare bot products

Please let us know if you have any further questions or issues.

Yuri | Cloudflare Support
Search the Cloudflare Community for advice and insight.

Online application: https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform
FAQ: https://support.cloudflare.com/hc/en-us/articles/360035387431-Frequently-asked-questions-about-Cloudflare-bot-products?source=search

@trujilloelsa @clementbiron @MattiSG I believe we should apply, what about you ?

@clementbiron
Copy link
Member Author

Yes ✔️

@martinratinaud
Copy link
Member

Validation approval just submitted

docs google com_forms_d_e_1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA_viewform

Waiting for their answer

@martinratinaud
Copy link
Member

As we have not had any answer in 40 days, I created a new topic on Cloudflare community

https://community.cloudflare.com/t/cloudflare-bot-verification-submitted-but-no-answer/320260

@clementbiron
Copy link
Member Author

I'm not sure this is a Cloudflare protection but running npm start Galeries Lafayette i get

2022-02-22 16:19:18 warn  Galeries Lafayette — Privacy Policy                     The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/service-confidence'
2022-02-22 16:19:18 warn  Galeries Lafayette — Terms of Service                   The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/conditions-generals'

with the following declaration

{
  "name": "Galeries Lafayette",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.galerieslafayette.com/service/service-confidence",
      "select": [".mainContent"]
    },
    "Terms of Service": {
      "fetch": "https://www.galerieslafayette.com/service/conditions-generals",
      "select": [".mainContent"]
    }
  }
}

@clementbiron
Copy link
Member Author

Same for

{
  "name": "GO Sport",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.go-sport.com/cgv/",
      "select": ["#content"]
    },
    "Privacy Policy": {
      "fetch": "https://www.go-sport.com/charte-protection-donnees-clients/",
      "select": ["#content"]
    }
  }
}

@clementbiron
Copy link
Member Author

Same for this declaration OpenTermsArchive/france-declarations@a0e6b46

@clementbiron
Copy link
Member Author

clementbiron commented Mar 22, 2022

I'm not sure it's about Cloudflare protection, but the following declarations return a 403 error:

@MattiSG
Copy link
Member

MattiSG commented Apr 24, 2023

We do not actively work on #166 at the moment. We will reopen it when we prioritise this work again. In the meantime, feel free to add any additional relevant information specific to Cloudflare to this issue.

@MattiSG MattiSG closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants