From fd5722f18e30805d948cd784b2b94fc575ae6985 Mon Sep 17 00:00:00 2001 From: Fawzi Mohamed Date: Tue, 19 Aug 2025 16:34:40 +0200 Subject: [PATCH 1/5] Adding guidelines on accessing external services Completing the internet access guide with the guidelines to access external services (bulk downloads, web scraping,...) --- docs/guides/internet-access.md | 42 ++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/docs/guides/internet-access.md b/docs/guides/internet-access.md index e77b6393..d16d3197 100644 --- a/docs/guides/internet-access.md +++ b/docs/guides/internet-access.md @@ -9,6 +9,7 @@ Login nodes have public IP addresses which means that they can directly access t Many services will rate limit or block usage based on the IP address if abused. An example is pulling container images from Docker Hub. [Authenticating with Docker Hub][ref-ce-third-party-private-registries] makes their rate limit apply per user instead. + See also the [guidelines below][ref-guides-internet-access-ext] ## Accessing the public IP of a node @@ -18,3 +19,44 @@ When on a login node configured with a public IP address, you can retrieve the p $ curl api.ipify.org 148.187.6.19 ``` + +[](){#ref-guides-internet-access-ext} +## Guidelines on communicating with external services (web scraping, bulk downloads,…) + +Alps is a an excellent machine to simulate, evaluate and analyze data, and communication within Alps is optimized. Communication with external services is often needed to set up a calculation or communicate results to others. + +To enable this CSCS has excellent connection (400Gbit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. + +Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **you should not** put load on services (like websites) that do not expect it, for example through **scraping**. + +### Shared resources + +If you need to heavily interact with external systems there are some caveats that you have to keep in mind, in general some resources are shared resources, and a single user should not monopolize their use. + +To avoid abuse there are measures in place at CSCS, on the transit networks, and on the remote systems, but these measures are often very blunt and would affect the CSCS as whole, so care should be taken to avoid triggering them. +We have a good relationship with Switch, so if we trigger some of their failsafes (for example their anti-DDOS tools), they will contact us. Other might take action without telling us anything. + +For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. +As far as we know, they don't publish the levels of the number of requests/queries per second that can trigger this kind of action, for some obvious reason that bad-intentioned people would stay just below this limit... + +So you should be mindful of your usage, in particular of the number of requests to the DNS and the network bandwidth. +Every access to a different domain will trigger a DNS request, using multiple nodes does not solve the problem, because they will still be hitting the same DNS resolver. + +We do have protection in place for our public DNS server, but other DNS servers might decide to blacklist the originator of all those requests. +On Alps currently we use an internal DNS, which is also used to resolve the different nodes in alps, and does not have special protections against abuse. For this reason **avoid scraping from Alps**, as it could lead to it being blacklisted. +Resources outside Alps can do 100s of request per second without problems. + +Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider (as google for example) with SWITCH, thus affecting almost any any user trying to query google from Switzerland. + +### Conclusions +Before any large scale sustained use of external resources think carefully about the load you are putting on the CSCS, network and target, both in number of requests and size of the request. + +Try to change the perspective: how quickly do you really need the whole data? Can you or should you use resources outside Alps, or even outside CSCS? Maybe geo-distributed? + +Also reach out to us, so that we are aware of what you are doing, and react quickly if we reach out to you. This last part worked well until now, and it is important that it continues to work well. + +Even if you did your homework and calculated that your load is acceptable it is important to understand that at the end it's the aggregated load across all users that counts, and if suddenly many users add an "acceptable" load it might not be so acceptable after all. + +Finally here we do not touch the legal aspect of the data collection which we expect you to clear separately: copyright/licensing issues, and storage of data that might contain private information, and consequently needs to be handled with due diligence to avoid data breaches. + +We want to support your ground breaking research, let's work together to find an acceptable solution for everybody, in the end being ethical is also about this. From a0426eb36e70cfe798c150f1a28932c64da702b4 Mon Sep 17 00:00:00 2001 From: Fawzi Mohamed Date: Tue, 19 Aug 2025 17:15:19 +0200 Subject: [PATCH 2/5] Fix spelling --- docs/guides/internet-access.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/guides/internet-access.md b/docs/guides/internet-access.md index d16d3197..05c544ae 100644 --- a/docs/guides/internet-access.md +++ b/docs/guides/internet-access.md @@ -25,16 +25,16 @@ $ curl api.ipify.org Alps is a an excellent machine to simulate, evaluate and analyze data, and communication within Alps is optimized. Communication with external services is often needed to set up a calculation or communicate results to others. -To enable this CSCS has excellent connection (400Gbit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. +To enable this CSCS has excellent connection (400GBit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. -Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **you should not** put load on services (like websites) that do not expect it, for example through **scraping**. +Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **you should not** put load on services (web sites,...) that do not expect it, for example through **scraping**. ### Shared resources If you need to heavily interact with external systems there are some caveats that you have to keep in mind, in general some resources are shared resources, and a single user should not monopolize their use. To avoid abuse there are measures in place at CSCS, on the transit networks, and on the remote systems, but these measures are often very blunt and would affect the CSCS as whole, so care should be taken to avoid triggering them. -We have a good relationship with Switch, so if we trigger some of their failsafes (for example their anti-DDOS tools), they will contact us. Other might take action without telling us anything. +We have a good relationship with Switch, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. Other might take action without telling us anything. For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. As far as we know, they don't publish the levels of the number of requests/queries per second that can trigger this kind of action, for some obvious reason that bad-intentioned people would stay just below this limit... @@ -46,7 +46,7 @@ We do have protection in place for our public DNS server, but other DNS servers On Alps currently we use an internal DNS, which is also used to resolve the different nodes in alps, and does not have special protections against abuse. For this reason **avoid scraping from Alps**, as it could lead to it being blacklisted. Resources outside Alps can do 100s of request per second without problems. -Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider (as google for example) with SWITCH, thus affecting almost any any user trying to query google from Switzerland. +Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider (as Google for example) with SWITCH, thus affecting almost any any user trying to query Google from Switzerland. ### Conclusions Before any large scale sustained use of external resources think carefully about the load you are putting on the CSCS, network and target, both in number of requests and size of the request. From b112aba0f5d61882fcdd5af26fae4ff839e177e5 Mon Sep 17 00:00:00 2001 From: bcumming Date: Wed, 20 Aug 2025 08:54:00 +0200 Subject: [PATCH 3/5] add GBit to spelling; clean up text --- .github/actions/spelling/allow.txt | 1 + docs/guides/internet-access.md | 22 ++++++++++++++-------- 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt index 62cc8410..e96d734a 100644 --- a/.github/actions/spelling/allow.txt +++ b/.github/actions/spelling/allow.txt @@ -28,6 +28,7 @@ Fawzi Fock Foket GAPW +GBit GGA GPFS GPG diff --git a/docs/guides/internet-access.md b/docs/guides/internet-access.md index 05c544ae..290bd9b0 100644 --- a/docs/guides/internet-access.md +++ b/docs/guides/internet-access.md @@ -25,9 +25,9 @@ $ curl api.ipify.org Alps is a an excellent machine to simulate, evaluate and analyze data, and communication within Alps is optimized. Communication with external services is often needed to set up a calculation or communicate results to others. -To enable this CSCS has excellent connection (400GBit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. +To enable this CSCS has excellent connection (400 GBit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. -Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **you should not** put load on services (web sites,...) that do not expect it, for example through **scraping**. +Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **do not** put load on services that do not expect it, for example through **scraping**. ### Shared resources @@ -36,19 +36,25 @@ If you need to heavily interact with external systems there are some caveats tha To avoid abuse there are measures in place at CSCS, on the transit networks, and on the remote systems, but these measures are often very blunt and would affect the CSCS as whole, so care should be taken to avoid triggering them. We have a good relationship with Switch, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. Other might take action without telling us anything. -For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. -As far as we know, they don't publish the levels of the number of requests/queries per second that can trigger this kind of action, for some obvious reason that bad-intentioned people would stay just below this limit... +For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. +Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. +In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. + +!!! info + Sites do not publish the number of requests/queries per second that trigger blacklisting, for some obvious reason that bad-intentioned people would stay just below this limit. So you should be mindful of your usage, in particular of the number of requests to the DNS and the network bandwidth. Every access to a different domain will trigger a DNS request, using multiple nodes does not solve the problem, because they will still be hitting the same DNS resolver. -We do have protection in place for our public DNS server, but other DNS servers might decide to blacklist the originator of all those requests. -On Alps currently we use an internal DNS, which is also used to resolve the different nodes in alps, and does not have special protections against abuse. For this reason **avoid scraping from Alps**, as it could lead to it being blacklisted. -Resources outside Alps can do 100s of request per second without problems. +CSCS has protection in place for our public DNS server, but other DNS servers might decide to blacklist the originator of all those requests. +Alps uses an internal DNS, which is also used to resolve the different nodes in alps, and does not have special protections against abuse. +For this reason **avoid scraping from Alps**, as it could lead to it being blacklisted. -Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider (as Google for example) with SWITCH, thus affecting almost any any user trying to query Google from Switzerland. +!!! info + Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider like Google, which would affect all Swiss Google users. ### Conclusions + Before any large scale sustained use of external resources think carefully about the load you are putting on the CSCS, network and target, both in number of requests and size of the request. Try to change the perspective: how quickly do you really need the whole data? Can you or should you use resources outside Alps, or even outside CSCS? Maybe geo-distributed? From 1d56fdf71480c2c5876b9f000ca2e5e8c07fd122 Mon Sep 17 00:00:00 2001 From: bcumming Date: Wed, 20 Aug 2025 09:02:22 +0200 Subject: [PATCH 4/5] further clean up --- docs/guides/internet-access.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/guides/internet-access.md b/docs/guides/internet-access.md index 290bd9b0..64ff45e9 100644 --- a/docs/guides/internet-access.md +++ b/docs/guides/internet-access.md @@ -23,25 +23,24 @@ $ curl api.ipify.org [](){#ref-guides-internet-access-ext} ## Guidelines on communicating with external services (web scraping, bulk downloads,…) -Alps is a an excellent machine to simulate, evaluate and analyze data, and communication within Alps is optimized. Communication with external services is often needed to set up a calculation or communicate results to others. +Communication with external services from Alps is provided by a high-capacity 400 GBit/s connection to [SWITCH](https://www.switch.ch/en/network/ip-access). +SWITCH provides internet services to the research and education infrastructure in Switzerland. -To enable this CSCS has excellent connection (400 GBit/s) to SWITCH.ch, that provides internet services to the research and education infrastructure in Switzerland. - -Still communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **do not** put load on services that do not expect it, for example through **scraping**. +However, communication with external services is not the focus of CSCS, it is rather seen as a way to enable the use of our resources, so for example as explained below from Alps **do not** put load on services that do not expect it, for example through **scraping**. ### Shared resources If you need to heavily interact with external systems there are some caveats that you have to keep in mind, in general some resources are shared resources, and a single user should not monopolize their use. To avoid abuse there are measures in place at CSCS, on the transit networks, and on the remote systems, but these measures are often very blunt and would affect the CSCS as whole, so care should be taken to avoid triggering them. -We have a good relationship with Switch, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. Other might take action without telling us anything. +We have a good relationship with SWITCH, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. Other might take action without telling us anything. For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. !!! info - Sites do not publish the number of requests/queries per second that trigger blacklisting, for some obvious reason that bad-intentioned people would stay just below this limit. + Sites do not publish the number of requests/queries per second that trigger blacklisting, for some obvious reason that bad-intentioned people would stay just below this limit. So you should be mindful of your usage, in particular of the number of requests to the DNS and the network bandwidth. Every access to a different domain will trigger a DNS request, using multiple nodes does not solve the problem, because they will still be hitting the same DNS resolver. @@ -50,8 +49,8 @@ CSCS has protection in place for our public DNS server, but other DNS servers mi Alps uses an internal DNS, which is also used to resolve the different nodes in alps, and does not have special protections against abuse. For this reason **avoid scraping from Alps**, as it could lead to it being blacklisted. -!!! info - Given the excellent connection of the CSCS network with SWITCH a sustained use of it can saturate the connection of a large provider like Google, which would affect all Swiss Google users. +!!! warning + The high-capacity of the CSCS-SWITCH connection can saturate the connection of a large provider like Google, which would affect all Swiss Google users. ### Conclusions From 171384d5e1c9fb63152130b118c622a609a2a51f Mon Sep 17 00:00:00 2001 From: bcumming Date: Wed, 20 Aug 2025 09:11:57 +0200 Subject: [PATCH 5/5] further clean up --- docs/guides/internet-access.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/docs/guides/internet-access.md b/docs/guides/internet-access.md index 64ff45e9..be94a68c 100644 --- a/docs/guides/internet-access.md +++ b/docs/guides/internet-access.md @@ -21,7 +21,14 @@ $ curl api.ipify.org ``` [](){#ref-guides-internet-access-ext} -## Guidelines on communicating with external services (web scraping, bulk downloads,…) +## Communicating with external services + +!!! note + Examples of the type of external communication that can trigger problems include: + + * web scraping; + * bulk downloads; + * pipelines constantly pulling the same image from DockerHub. Communication with external services from Alps is provided by a high-capacity 400 GBit/s connection to [SWITCH](https://www.switch.ch/en/network/ip-access). SWITCH provides internet services to the research and education infrastructure in Switzerland. @@ -33,11 +40,12 @@ However, communication with external services is not the focus of CSCS, it is ra If you need to heavily interact with external systems there are some caveats that you have to keep in mind, in general some resources are shared resources, and a single user should not monopolize their use. To avoid abuse there are measures in place at CSCS, on the transit networks, and on the remote systems, but these measures are often very blunt and would affect the CSCS as whole, so care should be taken to avoid triggering them. -We have a good relationship with SWITCH, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. Other might take action without telling us anything. +We have a good relationship with SWITCH, so if we trigger some of their fail-safes (for example their anti-DDoS tools), they will contact us. +External providers might take action, like blacklisting Alps, without warning or notification. -For example a website might blacklist IPs, or whole subnets from CSCS, which would make the service unavailable for all other CSCS users too. -Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist the CSCS many users will be affected. -In addition, once we are blacklisted, it's extremely difficult and long be able to get out of these blacklists. +For example a website might blacklist IPs, or whole subnets from CSCS, rendering the service unavailable for **all CSCS users**. +Many sites use content delivery networks (CDN), like Cloudflare, Akamai, or similar, and if those blacklist CSCS we would lose access to all content provided by those CDNs. +In addition, once blacklisted, it is very difficult to get removed from the blacklist. !!! info Sites do not publish the number of requests/queries per second that trigger blacklisting, for some obvious reason that bad-intentioned people would stay just below this limit.