Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.error.HTTPError: HTTP Error 404: Not Found #549

Open
Macodemia opened this issue Feb 29, 2024 · 0 comments
Open

urllib.error.HTTPError: HTTP Error 404: Not Found #549

Macodemia opened this issue Feb 29, 2024 · 0 comments

Comments

@Macodemia
Copy link

Hello,

I have cloned the project and created a config file using the responsible Python script.
With wg-gesucht and immowelt the scraping works just perfectly fine.
However, when scraping kleinanzeigen, i receive the error:

urllib.error.HTTPError: HTTP Error 404: Not Found

Below is the full stack trace:

/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/bin/python /Users/jack/Development/flathunter/flathunt.py 
[2024/02/29 21:04:21|config.py               |INFO    ]: Using config path /Users/jack/Development/flathunter/config.yaml
[2024/02/29 21:04:21|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
Traceback (most recent call last):
  File "/Users/jack/Development/flathunter/flathunt.py", line 99, in <module>
    main()
  File "/Users/jack/Development/flathunter/flathunt.py", line 95, in main
    launch_flat_hunt(config, heartbeat)
  File "/Users/jack/Development/flathunter/flathunt.py", line 35, in launch_flat_hunt
    hunter.hunt_flats()
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 56, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in crawl_for_exposes
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 35, in <listcomp>
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/hunter.py", line 27, in try_crawl
    return searcher.crawl(url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 151, in crawl
    return self.get_results(url, max_pages)
  File "/Users/jack/Development/flathunter/flathunter/abstract_crawler.py", line 139, in get_results
    soup = self.get_page(search_url)
  File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 56, in get_page
    return self.get_soup_from_url(search_url, driver=self.get_driver())
  File "/Users/jack/Development/flathunter/flathunter/crawler/kleinanzeigen.py", line 44, in get_driver
    self.driver = get_chrome_driver(driver_arguments)
  File "/Users/jack/Development/flathunter/flathunter/chrome_wrapper.py", line 69, in get_chrome_driver
    driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 258, in __init__
    self.patcher.auto()
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 178, in auto
    self.unzip_package(self.fetch_package())
  File "/Users/jack/.local/share/virtualenvs/flathunter-Uqle8Kpe/lib/python3.10/site-packages/undetected_chromedriver/patcher.py", line 287, in fetch_package
    return urlretrieve(download_url)[0]
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/Users/jack/anaconda3/lib/python3.10/urllib/request.py

I found related issues #538 and #439.
In there it seems like the problem is related to the version of chrome driver.

Since in my case flathunter works for wg-gesucht and immowelt,
I assume that this issue is different and may be related to kleinanzeigen.

chromedriver --version
ChromeDriver 121.0.6167.184 (057a8ae7deb3374d0f1b04b36304d236f0136188-refs/branch-heads/6167@{#1818})

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Google Chrome 121.0.6167.184

and my config file:

# Enable verbose mode (print DEBUG log messages)
# verbose: true

# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
  active: yes
  sleeping_time: 600

# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database

# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, www.kleinanzeigen.de, meinestadt.de and vrm-immo.de.
# List the URLs in the following format:
# urls:
# 	- https://www.immobilienscout24.de/Suche/...
# 	- https://www.wg-gesucht.de/...
urls:
- https://www.kleinanzeigen.de/s-wohnung-mieten/schoeneberg/c203l3443
#- https://www.wg-gesucht.de/wohnungen-in-Muenchen.90.2.1.0.html
#- https://www.immowelt.de/suche/berlin/wohnungen/mieten?d=true&pma=1200&rmi=2&sd=DESC&sf=TIMESTAMP&sp=1

# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
#   'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
#   excluded_titles:
#     - "wg"
#     - "zwischenmiete"
#   min_price: 700
#   max_price: 1000
#   min_size: 50
#   max_size: 80
#   max_price_per_square: 1000
filters:

# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
#
# blacklist:
#   - Innenstadt

# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#  
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
#	- "bicycling"
#	- "transit" (public transport)
#	- "driving"
# - "walking"
# 
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
#   - name: John
#     destination: Hauptbahnhof, München
#     modes:
#       - gm_id: transit
#         title: "Öff."
#       - gm_id: bicycling
#         title: "Rad"
#   - name: Jane
#     destination: Karlsplatz, München
#     modes:
#       - gm_id: transit
#         title: "Öff."
#       - gm_id: driving
#         title: "Auto"

# Multiline message (yes, the | is supposed to be there), 
# to format the message received from the Telegram bot. 
# 
# Available placeholders:
# 	- {title}: The title of the expose
#	- {rooms}: Number of rooms
#	- {price}: Price for the flat
# 	- {durations}: Durations calculated by GMaps, see above
#	- {url}: URL to the expose
message: |
  {title}
  Zimmer: {rooms}
  Größe: {size}
  Preis: {price}
  Ort: {address}

  {url}

# Calculating durations requires access to the Google Maps API. 
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that. 
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
#
# google_maps_api:
#   key: YOUR_API_KEY
#   url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
#   enable: False

# If you are planning to scrape immoscout24.de, the bot will need 
# to circumvent the sites captcha protection by using a captcha 
# solving service. Register at either imagetypers or 2captcha 
# (the former is prefered), desposit some funds, uncomment the 
# corresponding lines below and replace your API key/token.
# Use driver_arguments to provide options for Chrome WebDriver.
# captcha:
#       imagetyperz:
#             token: alskdjaskldjfklj
#       2captcha:
#             api_key: alskdjaskldjfklj
#       driver_arguments:
#         - "--headless"
captcha:

# You can select whether to be notified by telegram, apprise or by mattermost
# or Slack webhooks. For all notifiers selected here a configuration must be 
# provided below.
# notifiers:
#   - telegram
#   - apprise
#   - mattermost
#   - slack
notifiers:
- telegram

# Sending messages using Telegram requires a Telegram Bot configured. 
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot. 
#
# telegram:
#   bot_token: 160165XXXXXXX....
#   notify_with_images: true
#   receiver_ids:
#       - 12345....
#       - 67890....
telegram:
  bot_token: 6896489191:AAGvdqFTdJWUDHhT6qOzWSSZhrJ23WZkopg
  receiver_ids:
  - '16861054'

# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
#   webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356
mattermost:

# Sending messages using Apprise requires an Apprise url.
# Apprise allows to send notifications to a wide variety of services.
# You can find a description how to set up an Apprise url with the official
# documentation: https://github.com/caronc/apprise
# Signal notifications are documented here https://github.com/caronc/apprise/wiki/Notify_signal
#
# apprise:
#   - gotifys://...
#   - mailto://..
#   - signal://localhost:9922/{FromPhoneNo}
apprise:

# Sending messages to a Slack channel requires a webhook url. You can find 
# a guide on how to set up a Slack webhook in the official documentation:
# https://api.slack.com/messaging/webhooks
#
# slack:
#   webhook_url: https://hooks.slack.com/services/T00000000/B00000000/XXXXXX...
slack:

# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
#    bot_name: bot_name_xxx
#    domain: flathunter.example.com
#    session_key: SomeSecretValue
#    listen:
#      host: 127.0.0.1
#      port: 8080

# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id

# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True

# If you are having bot detection issues with immobilienscout24,
# you can set the cookie that you get from your logged in account
# Go to the immobilienscout24.de website, log in, and then in the developer tools
# (F12) go to the "Network" tab, then "Cookies" and copy the value of the
# "reese84" cookie.
immoscout_cookie: ''

I appreciate any help on that!
Please let me know, if any further information is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant