Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Error 404 from urllib during WebDriver initialization #439

Closed
djuelg opened this issue Aug 12, 2023 · 15 comments
Closed

HTTP Error 404 from urllib during WebDriver initialization #439

djuelg opened this issue Aug 12, 2023 · 15 comments

Comments

@djuelg
Copy link

djuelg commented Aug 12, 2023

Hello altogether,

using the current main version, I seem to have the same issue as #257, but for different reasons. The script crashes while trying to initialize the Chrome WebDriver for the crawler. I'm getting the following output:

flathunter-app-1  | [2023/08/12 10:07:32|config.py               |INFO    ]: Using config path /usr/src/app/config.yaml
flathunter-app-1  | [2023/08/12 10:07:32|logging.py              |DEBUG   ]: Settings from config: {"captcha_enabled": true, "captcha_driver_arguments": ["--no-sandbox", "--headless", "--disable-gpu", "--remote-debugging-port=9222", "--disable-dev-shm-usage", "window-size=1024,768"], "captcha_solver": "TwoCaptchaSolver", "imagetyperz_token": null, "twocaptcha_key": "35dxxxxxxxxxxxxxxxxxxxxxxxxxxd60", "mattermost_webhook_url": null, "notifiers": ["telegram"], "slack_webhook_url": "", "telegram_receiver_ids": ["-9******"], "telegram_bot_token": "595xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxfck", "target_urls": ["https://www.immobilienscout24.de/Suche/de/sachsen/leipzig/wohnung-mieten?numberofrooms=2.0-&price=-900.0&exclusioncriteria=swapflat&equipment=builtinkitchen,balcony&pricetype=calculatedtotalrent&geocodes=147130000040,147130000030,147130000031,147130000020,147130000022,147130000000,147130000001,147130000002,147130000003,147130000004,147130000005,147130000006&sorting=2", "https://www.immowelt.de/liste/leipzig-neustadt-neuschoenefeld/wohnungen/mieten?d=true&ffs=BALCONY_OR_TERRACE&ffs=FITTED_KITCHEN&lids=508689&lids=508686&lids=508683&lids=508682&lids=508681&lids=508678&lids=508671&lids=508666&lids=508657&lids=508589&pma=750&rmi=2&sd=DESC&sf=TIMESTAMP&sp=1", "https://www.ebay-kleinanzeigen.de/s-wohnung-mieten/leipzig/anzeige:angebote/preis::750/wohnung/k0c203l4233+wohnung_mieten.swap_s:nein+wohnung_mieten.zimmer_d:2%2C4+options:wohnung_mieten.balcony_b,wohnung_mieten.built_in_kitchen_b", "https://www.wg-gesucht.de/wohnungen-in-Leipzig.77.2.1.0.html?offer_filter=1&city_id=77&sort_order=0&noDeact=1&categories%5B%5D=2&rent_types%5B%5D=2&rent_types%5B%5D=2&rMax=750&ot%5B%5D=1767&ot%5B%5D=1794&ot%5B%5D=1799&ot%5B%5D=1806&ot%5B%5D=1807&ot%5B%5D=1811&ot%5B%5D=1812&ot%5B%5D=1813&ot%5B%5D=1814&ot%5B%5D=1815&ot%5B%5D=1816&ot%5B%5D=1817&rmMin=2&rmMax=4&exc=2&kit=1&bal_or_ter=1"], "use_proxy": false}
flathunter-app-1  | [2023/08/12 10:07:32|immobilienscout.py      |DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/de/sachsen/leipzig/wohnung-mieten?numberofrooms=2.0-&price=-900.0&exclusioncriteria=swapflat&equipment=builtinkitchen,balcony&pricetype=calculatedtotalrent&geocodes=147130000040,147130000030,147130000031,147130000020,147130000022,147130000000,147130000001,147130000002,147130000003,147130000004,147130000005,147130000006&sorting=2&pagenumber={0}
flathunter-app-1  | [2023/08/12 10:07:32|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
flathunter-app-1  | Traceback (most recent call last):
flathunter-app-1  |   File "/usr/src/app/flathunt.py", line 99, in <module>
flathunter-app-1  |     main()
flathunter-app-1  |   File "/usr/src/app/flathunt.py", line 95, in main
flathunter-app-1  |     launch_flat_hunt(config, heartbeat)
flathunter-app-1  |   File "/usr/src/app/flathunt.py", line 35, in launch_flat_hunt
flathunter-app-1  |     hunter.hunt_flats()
flathunter-app-1  |   File "/usr/src/app/flathunter/hunter.py", line 56, in hunt_flats
flathunter-app-1  |     for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
flathunter-app-1  |                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/hunter.py", line 35, in crawl_for_exposes
flathunter-app-1  |     return chain(*[try_crawl(searcher, url, max_pages)
flathunter-app-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/hunter.py", line 35, in <listcomp>
flathunter-app-1  |     return chain(*[try_crawl(searcher, url, max_pages)
flathunter-app-1  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/hunter.py", line 27, in try_crawl
flathunter-app-1  |     return searcher.crawl(url, max_pages)
flathunter-app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 151, in crawl
flathunter-app-1  |     return self.get_results(url, max_pages)
flathunter-app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/crawler/immobilienscout.py", line 90, in get_results
flathunter-app-1  |     soup = self.get_page(search_url, self.get_driver(), page_no)
flathunter-app-1  |                                      ^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/crawler/immobilienscout.py", line 65, in get_driver
flathunter-app-1  |     self.driver = get_chrome_driver(driver_arguments)
flathunter-app-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/src/app/flathunter/chrome_wrapper.py", line 50, in get_chrome_driver
flathunter-app-1  |     driver = uc.Chrome(version_main=chrome_version, options=chrome_options) # pylint: disable=no-member
flathunter-app-1  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 247, in __init__
flathunter-app-1  |     self.patcher.auto()
flathunter-app-1  |   File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/patcher.py", line 158, in auto
flathunter-app-1  |     release = self.fetch_release_number()
flathunter-app-1  |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/patcher.py", line 222, in fetch_release_number
flathunter-app-1  |     return LooseVersion(urlopen(self.url_repo + path).read().decode())
flathunter-app-1  |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen
flathunter-app-1  |     return opener.open(url, data, timeout)
flathunter-app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 525, in open
flathunter-app-1  |     response = meth(req, response)
flathunter-app-1  |                ^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 634, in http_response
flathunter-app-1  |     response = self.parent.error(
flathunter-app-1  |                ^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 563, in error
flathunter-app-1  |     return self._call_chain(*args)
flathunter-app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 496, in _call_chain
flathunter-app-1  |     result = func(*args)
flathunter-app-1  |              ^^^^^^^^^^^
flathunter-app-1  |   File "/usr/local/lib/python3.11/urllib/request.py", line 643, in http_error_default
flathunter-app-1  |     raise HTTPError(req.full_url, code, msg, hdrs, fp)
flathunter-app-1  | urllib.error.HTTPError: HTTP Error 404: Not Found
flathunter-app-1  | [2023/08/12 10:07:33|__init__.py             |INFO    ]: ensuring close
flathunter-app-1 exited with code 0

The script crashes only if the immobilienscout24 url is part of the search. The other three urls work just fine.
I added the following information to the logger output, which shows that the version parsing is not the problem (like it was in #257):

flathunter-app-1  | [2023/08/12 10:07:32|chrome_wrapper.py       |INFO    ]: CHROME VERSION: Google Chrome 115.0.5790.170
flathunter-app-1  | [2023/08/12 10:07:32|chrome_wrapper.py       |INFO    ]: SPLITTED CHROME VERSION: 115

I tested other immobilienscout24 urls which also didn't seem to work.
The script is running via docker-compose on an Ubuntu 20.04.6 host.

@vincentvonu
Copy link

It is an issue with undetected_chromedriver. The following pull has fixed the issue for me: ultrafunkamsterdam/undetected-chromedriver#1427

@silasburger
Copy link

@djuelg Any luck?

@CMMenzel
Copy link

Hi Folks, I get exactly the same error. However, also new to linux. @vincentvonu what exactly should I do to get the pull to my machine (ubuntu 22.04)? Would be awesome if you can explain it to a dummy :-) Thanks!

@djuelg
Copy link
Author

djuelg commented Aug 19, 2023

@djuelg Any luck?

So far, I haven't had time to try vincentvonus approach. The easiest solution for now would probably be to put the deb file of an older chrome version into your local flathunter checkout and then edit the Dockerfile to install the local deb-file instead of the ppa.

@vincentvonu
Copy link

vincentvonu commented Aug 19, 2023

Hi Folks, I get exactly the same error. However, also new to linux. @vincentvonu what exactly should I do to get the pull to my machine (ubuntu 22.04)? Would be awesome if you can explain it to a dummy :-) Thanks!

You have to replace the file patcher.py in the undetected_chromedriver lib with the one from the pull I've linked.

If you have created a virtualenv for your flathunter instance, you should find this file under: /home/[user]/.local/share/virtualenvs/[name of your flathunter venv]/lib/python/site-packages/undetected_chromedriver/ (this is the debian path, ubuntu might be slightly different)

Hope that helps!

@Krystex
Copy link

Krystex commented Aug 21, 2023

Hi all. This is a patch for the Dockerfile, which basically pulls an old version of chrome. A bit hacky but it works.

diff --git a/Dockerfile b/Dockerfile
index 793656d..60b3444 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -9,6 +9,11 @@ RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key
 RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
 RUN apt-get -y update
 RUN apt-get install -y google-chrome-stable
+# Check available versions here: https://www.ubuntuupdates.org/package/google_chrome/stable/main/base/google-chrome-stable
+ARG CHROME_VERSION="114.0.5735.90-1"
+RUN wget --no-verbose -O /tmp/chrome.deb https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_${CHROME_VERSION}_amd64.deb \
+  && dpkg -i /tmp/chrome.deb \
+  && rm /tmp/chrome.deb
 
 # Upgrade pip, install pipenv
 RUN pip install --upgrade pip

In my opinion the installation of Chrome should be version-pinned so stuff like this doesn't happen.

@codders
Copy link

codders commented Aug 22, 2023

I've merged the PR from dependabot that updates undetected_chrome to the latest version. I don't recommend pinning the version of Chrome in the docker file. The scraping and scraper-detection are a moving target, and although it makes flathunter in total less stable, the websites we are scraping are constantly changing too, so the best chance you have for getting support for your issues is if you're using the same software as everyone else (in this case, the latest version of chrome).

@CMMenzel
Copy link

Hi Folks, I get exactly the same error. However, also new to linux. @vincentvonu what exactly should I do to get the pull to my machine (ubuntu 22.04)? Would be awesome if you can explain it to a dummy :-) Thanks!

You have to replace the file patcher.py in the undetected_chromedriver lib with the one from the pull I've linked.

If you have created a virtualenv for your flathunter instance, you should find this file under: /home/[user]/.local/share/virtualenvs/[name of your flathunter venv]/lib/python/site-packages/undetected_chromedriver/ (this is the debian path, ubuntu might be slightly different)

Hope that helps!

Thanks a lot @vincentvonu . I tried this approach, but somehow got an error. Fixed it by simply getting Chrome 114.

@codders
Copy link

codders commented Aug 22, 2023 via email

@codders
Copy link

codders commented Aug 23, 2023

Okay. I take it back. This seems to be 100% a problem with undetected_chromedriver not supporting the latest versions (115+) of Google Chrome, and pinning does seem to be the best option for right now. Which sucks =/ Thanks for the patch @Krystex

@Krystex
Copy link

Krystex commented Aug 23, 2023

You're welcome! That's why I think version pinning is the right move, things are easier to troubleshoot if everyone has the same version, and you can just bump the latest working chrome version. But I understand, not everyone has docker so it's a bit hard to enforce.

@paulwelzel
Copy link

Hey, any idea how to deploy the fix that seems to work in local docker images for deployment as app engine in gcloud? Struggling with this

@codders
Copy link

codders commented Aug 26, 2023

Hi @paulwelzel,

The Google App Engine deployment doesn't support crawling Immobilienscout because we can't install chrome in the App Engine environment. For Google Cloud Run, you can apply the same dockerfile changes to the Dockerfile.gcloud.job file and that should pin the chrome version for you.

@codders
Copy link

codders commented Aug 29, 2023

Just merged #454, which bumps undetected-chromedriver to a version that should support the latest chrome. Please retest and check if this resolves your issues - thanks!

@codders
Copy link

codders commented Aug 31, 2023

Also merged #460 which fixes a bug in the flathunter's use of the new undetected-chromedriver. Closing this for now - please re-open if you're still having trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants