### Debug Collector Error

We have have issues which come from being unable to collect data from a url. This notebook gives some quickly configurable code with the problematic url  to replicate the error and also un a simple request to check if it's because of our user agent.

In [2]:
%load_ext autoreload
%autoreload 2

from digital_land.collect import Collector
from pathlib import Path
import requests

In [3]:
endpoint_url = "https://www.worcester.gov.uk/files/PDF%20Documents/Planning/planning_policy/brownfieldsites/Brownfield%20Land%20Register%202025.csv"
plugin = ''

data_dir = Path('./data/debug_collector_error')
data_dir.mkdir(parents=True, exist_ok=True)

In [6]:

collection_dir = data_dir / 'collection'
collection_dir.mkdir(parents=True, exist_ok=True)
resource_dir = collection_dir / 'resource'

collector = Collector(resource_dir=resource_dir)
collector.fetch(
        endpoint_url,
        plugin="",
        refill_todays_logs=True,
    )

(<FetchStatus.FAILED: 5>,
 {'endpoint-url': 'https://www.worcester.gov.uk/files/PDF%20Documents/Planning/planning_policy/brownfieldsites/Brownfield%20Land%20Register%202025.csv',
  'entry-date': '2026-01-06T10:16:00.975030',
  'ssl-verify': True,
  'status': '307',
  'request-headers': {'User-Agent': 'MHCLG Planning Data Collector',
   'Accept-Encoding': 'gzip, deflate',
   'Accept': '*/*',
   'Connection': 'keep-alive'},
  'response-headers': {'Date': 'Tue, 06 Jan 2026 10:17:44 GMT',
   'Content-Type': 'text/html',
   'Transfer-Encoding': 'chunked',
   'Connection': 'keep-alive',
   'X-Sucuri-ID': '21007',
   'X-XSS-Protection': '1; mode=block',
   'X-Frame-Options': 'SAMEORIGIN',
   'X-Content-Type-Options': 'nosniff',
   'Content-Security-Policy': 'upgrade-insecure-requests;',
   'Server': 'Sucuri/Cloudproxy',
   'Alt-Svc': 'h3=":443"; ma=2592000, h3-29=":443"; ma=2592000'},
  'elapsed': '0.091'})

If an error is appearing above then use the below code to use head requests to check for common causes which are:

* User Agent Error - this is when the request is only okay when we don't incldue our user agent. This is common as often web hosting providers stop rer-occuring user agents
* SSL Error - this happens when we can't verify the SSL certificate

Fo both these errors action needs to be take on the poviders end as we don't have control over these problems. if it's neither of these errors then it tends to imply that our code is adding complexity that is breaking it and you can review our code

In [7]:
try:
    # Default is verify=True, but you can be explicit
    response_original = requests.head(
        endpoint_url, 
        timeout=5, 
        verify=True, 
        headers={"User-Agent": collector.user_agent},
        allow_redirects=True
    )
    if response_original.status_code == 200:
        print(f"{endpoint_url} is OK (200). SSL verified.")
    else:
        response_no_user_agent = requests.head(endpoint_url, timeout=5, verify=True,allow_redirects=True )

        print(f"{endpoint_url} returned status {response_no_user_agent.status_code}. SSL verified.")
except requests.exceptions.SSLError as ssl_err:
    print(f"SSL verification failed for {endpoint_url}: {ssl_err}")
    try:
        response_no_verify = requests.head(
            endpoint_url, 
            timeout=5, 
            verify=False, 
            headers={"User-Agent": collector.user_agent},
            allow_redirects=True
        )
        if response_no_verify.status_code == 200:
            print(f"{endpoint_url} is OK (200) when SSL verification is disabled.")
        else:
            print(f"{endpoint_url} returned status {response_no_verify.status_code} when SSL verification is disabled.")
    except requests.RequestException as e:
        print(f"Error checking {endpoint_url} with SSL verification disabled: {e}")
except requests.RequestException as e:
    print(f"Error checking {endpoint_url}: {e}")


https://www.worcester.gov.uk/files/PDF%20Documents/Planning/planning_policy/brownfieldsites/Brownfield%20Land%20Register%202025.csv returned status 307. SSL verified.


A generic download of the url without any of our set up. Usseful for further debugging and playing with content etc.

In [10]:
response = requests.get(
        endpoint_url, 
        timeout=5, 
        verify=True, 
        headers={"User-Agent": collector.user_agent},
        allow_redirects=True
    )
print(f"GET request status code: {response.status_code}")
print(response.content)

GET request status code: 307
b"\t        <html><title>You are being redirected...</title>\n\t\t<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>\n\t\t<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='bz0iMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMSkgKyAiZiIgKyAnOScgKyAnYycgKyAnZicgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDU3KSArICdiJyArICJhIiArICcyJyArICJhIiArICczJyArIFN0cmluZy5mcm9tQ2hhckNvZGUoMTAyKSArIFN0cmluZy5mcm9tQ2hhckNvZGUoMTAyKSArICJlIiArIFN0cmluZy5mcm9tQ2hhckNvZGUoNTcpICsgJzgnICsgJzUnICsgIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDApICsgJzQnICsgJ2InICsgIjgiICsgIjciICsgIjciICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDEpICsgJzAnICsgJ2InICsgU3RyaW5nLmZyb21DaGFyQ29kZSg5OCkgKyAnZicgKyAiMSIgKyAnOScgKyAnJztkb2N1bWVudC5jb29raWU9J3MnKyd1JysnYycrJ3UnKydyJysnaScrJ18nKydjJysnbCcrJ28nKyd1JysnZCcrJ3AnKydyJysnbycrJ3gnKyd5JysnXycrJ3UnKyd1JysnaScrJ2QnKydfJysnZCcrJzYnKydjJysnZScrJzQnKydjJysnOCcrJ2QnKycwJysiPSIgKyBvICsgJztwYXRoPS87