Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

www.immobilienscout24.de crawl_immobilienscout.py|ERROR ]: Index error occurred #214

Closed
heapxor opened this issue Sep 5, 2022 · 25 comments

Comments

@heapxor
Copy link

heapxor commented Sep 5, 2022

hello,
using following url in config>
urls:
- https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=2.5-&price=-1000.0&livingspace=65.0-&pricetype=rentpermonth&sorting=2

after execution i am getting following error .... is that because of 2captcha is missing in config file?

flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
[2022/09/05 15:41:58|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml
[2022/09/05 15:41:58|crawl_immobilienscout.py|ERROR ]: Index error occurred

^CTraceback (most recent call last):
File "/home/flathunter/flathunter/flathunt.py", line 110, in
main()
File "/home/flathunter/flathunter/flathunt.py", line 106, in main
launch_flat_hunt(config, heartbeat)
File "/home/flathunter/flathunter/flathunt.py", line 36, in launch_flat_hunt
time.sleep(config.loop_period_seconds())
KeyboardInterrupt

thanks!

@codders
Copy link

codders commented Sep 6, 2022

Yes - that's very likely. Crawling immoscout without 2captcha / imagetyperz support is expected to fail. Does it work if you configure the captcha solving?

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

@codders is there any diff between 2caotcha / imagetyperz? thaz

@codders
Copy link

codders commented Sep 9, 2022

There isn't much difference. For a while we had problems with 2captcha, so we integrated Imagetyprz as a backup, but 2captcha is working fine again since a while.

I would suggest 2captcha for now - that's what I'm using, so at least if it breaks there is someone trying to fix it for you :)

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

@codders sure thanks.

still getting weird error :(

flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 13:05:04|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 13:05:04|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 13:05:05|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/d9b6d2334d8b50fd_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType

@codders
Copy link

codders commented Sep 9, 2022

Can you set verbose_logging to true in your config and try again? Also might be good to clear out the webdriver-manager cache (/home/flathunter/.wdm).

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

@codders i turned verbose logging on; but i cant see that folder connected to the wedriver-manager cache; do i have to install webdriver-manager cache?

flathunter@docker-base:~/flathunter$ ls /home/flathunter/.wdm
ls: cannot access '/home/flathunter/.wdm': No such file or directory

the error is here>
flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 16:16:07|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 16:16:07|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f820dc66a70> [2022/09/09 16:16:07|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 16:16:08|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/c1484b1e513af397_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType

@codders
Copy link

codders commented Sep 9, 2022

Seems like you're not the only person with this issue:
ultrafunkamsterdam/undetected-chromedriver#285
ultrafunkamsterdam/undetected-chromedriver#787
ultrafunkamsterdam/undetected-chromedriver#193

What can you tell me about your execution environment? Is Google Chrome / Chromium definitely installed? Are you running inside any kind of container or virtualisation?

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

@codders running it as linux user on ubuntu sever;
is google chrome/chromium package requ? i cant see it in prerequisite.

thanks

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

@codders okay i installed chrome driver and chromium browser, executed code as follows

flathunter@heap-virtual-machine:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
[2022/09/10 00:22:41|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml
[2022/09/10 00:22:41|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f2d146ef310>
[2022/09/10 00:22:41|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"...
[2022/09/10 00:22:41|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8f45eca73f1bdb25_chromedriver

end thats the error:

`Traceback (most recent call last):
File "/home/flathunter/flathunter/flathunt.py", line 110, in
main()
File "/home/flathunter/flathunter/flathunt.py", line 76, in main
config.init_searchers()
File "/home/flathunter/flathunter/flathunter/config.py", line 96, in init_searchers
CrawlImmobilienscout(self),
File "/home/flathunter/flathunter/flathunter/crawl_immobilienscout.py", line 39, in init
self.driver = self.configure_driver(driver_arguments)
File "/home/flathunter/flathunter/flathunter/abstract_crawler.py", line 59, in configure_driver
driver = uc.Chrome(options=chrome_options)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/init.py", line 401, in init
super(Chrome, self).init(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in init
super().init(DesiredCapabilities.CHROME['browserName'], "goog",
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in init
super().init(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 270, in init
self.start_session(capabilities, browser_profile)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/init.py", line 589, in start_session
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 363, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
self.error_handler.check_response(response)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:44791
from chrome not reachable
Stacktrace:
#0 0x556e5caed693
#1 0x556e5c8e69db
#2 0x556e5c8d681e
#3 0x556e5c90f677
#4 0x556e5c906e9f
#5 0x556e5c942953
#6 0x556e5c93c743
#7 0x556e5c912533
#8 0x556e5c913715
#9 0x556e5cb3d7bd
#10 0x556e5cb40bf9
#11 0x556e5cb22f2e
#12 0x556e5cb419b3
#13 0x556e5cb16e4f
#14 0x556e5cb60ea8
#15 0x556e5cb61052
#16 0x556e5cb7b71f
#17 0x7fe397279b43

`

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

maybe the issue is that ubutu doesnt have chrome but chromium only?

flathunter@heap-virtual-machine:~/flathunter$ chrome --version Command 'chrome' not found, did you mean: command 'chroma' from deb chroma (1.19-1ubuntu1) command 'chroma' from deb golang-chroma (0.9.4-1) Try: apt install <deb name> flathunter@heap-virtual-machine:~/flathunter$ chromium --version Chromium 105.0.5195.52 snap flathunter@heap-virtual-machine:~/flathunter$

edit2
i tried to install chrome via howto posted here > https://linuxize.com/post/how-to-install-google-chrome-web-browser-on-ubuntu-20-04/

edit3
still crashing same error

similar issue?
https://stackoverflow.com/questions/73115181/message-unknown-error-cannot-connect-to-chrome-at-127-0-0-150276
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:44729
from chrome not reachable

@heapxor
Copy link
Author

heapxor commented Sep 9, 2022

i run Gnome; executed it via gnome ... browser got opened

2022/09/10 00:42:04|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8c382c85cf8c97b3_chromedriver [2022/09/10 00:42:25|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.8s (selenium.common.exceptions.TimeoutException: Message: Stacktrace: #0 0x564be599f693 <unknown> #1 0x564be5798b0a <unknown> #2 0x564be57d15f7 <unknown> #3 0x564be57d17c1 <unknown> #4 0x564be5804804 <unknown> #5 0x564be57ee94d <unknown> #6 0x564be58024b0 <unknown> #7 0x564be57ee743 <unknown> #8 0x564be57c4533 <unknown> #9 0x564be57c5715 <unknown> #10 0x564be59ef7bd <unknown> #11 0x564be59f2bf9 <unknown> #12 0x564be59d4f2e <unknown> #13 0x564be59f39b3 <unknown> #14 0x564be59c8e4f <unknown> #15 0x564be5a12ea8 <unknown> #16 0x564be5a13052 <unknown> #17 0x564be5a2d71f <unknown> #18 0x7f1d4a8a7b43 <unknown>)

the configuration regardin the telegram might be confusing? the receiver_ids is negative number... so in configuration i assume it should be set as following:

receiver_ids:
- -343343434

correct?

Also in case i run the script now ... i am getting logs as below, is that correct behavior?

[2022/09/10 00:57:18|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.4s (selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
#0 0x5571f5b8b693
#1 0x5571f5984b0a
#2 0x5571f59bd5f7
#3 0x5571f59bd7c1
#4 0x5571f59f0804
#5 0x5571f59da94d
#6 0x5571f59ee4b0
#7 0x5571f59da743
#8 0x5571f59b0533
#9 0x5571f59b1715
#10 0x5571f5bdb7bd
#11 0x5571f5bdebf9
#12 0x5571f5bc0f2e
#13 0x5571f5bdf9b3
#14 0x5571f5bb4e4f
#15 0x5571f5bfeea8
#16 0x5571f5bff052
#17 0x5571f5c1971f
#18 0x7f5fb82b8b43 )
[2022/09/10 00:57:23|idmaintainer.py |DEBUG ]: is_processed(136352826)
[2022/09/10 00:57:23|idmaintainer.py |DEBUG ]: is_processed(136352314)

@codders
Copy link

codders commented Sep 10, 2022

Hey @heapxor ,

Sorry you're having some troubles here. It would certainly make sense for us to update the documentation based on the spots that didn't make sense for you.

Incidentally, if you are looking for a flat in Berlin, you might also have success just using the hosted version at https://flathunter.codders.io - you can just log in there with Telegram and set a (basic) filter.

But otherwise, I hope to have some time in the next days to look at your issues, or else maybe someone else can support you.

@codders
Copy link

codders commented Sep 10, 2022

@heapxor ,

I don't think your Telegram ID should be a negative number. How did you get that?

To make chrome work, maybe try these driver-arguments:

                - --no-sandbox
                - --headless
                - --disable-gpu
                - --remote-debugging-port=9222
                - --disable-dev-shm-usage
                - window-size=1024,768

@heapxor
Copy link
Author

heapxor commented Sep 10, 2022

@codders,
why do u think it shouldnt be a negative number?
when i call this > my bot receives messge
curl "https://api.telegram.org/botTOKEN/sendMEssage?chat_id=-628934068&text=Hello+World"
and u can see id is negative number.

where i can put these driver arguments? thanks!

@heapxor
Copy link
Author

heapxor commented Sep 10, 2022

also wondering is that a proper behavior?

image

and here i can see the crash after 7h of running code

image
+
image

@codders
Copy link

codders commented Sep 10, 2022

Ah. okay. then that must be your ID :)

Arguments go in the config file like this:
2022-09-10-120234_556x303_scrot

If you are seeing crashes after some hours, try the arguments here. If the problem persists, maybe check that you have enough memory free (around 1GB for the browser and python etc.). But happy that you are receiving messages now :)

@heapxor
Copy link
Author

heapxor commented Sep 10, 2022

@codders cool will try the arguments! thanks

just wondering ... is that something that has to be analyzed further or thats okay?

image

yes will try to add more ram to that machine

@codders
Copy link

codders commented Sep 10, 2022

The CAPCHA_NOT_READY message is very normal. That happens every time a captcha is solved.

CaptchaUnsolvableError also happens from time to time. Sometimes, 2captcha just can't solve the captcha. The IndexError: list index out of range happens with ImmoScout when the captcha solving fails.

When you get these errors, best is just to restart. I want to change the code soon so that it retries if it gets a CaptchaUnsolvableError. But if you see that message every time, it is probably a problem with 2captcha (or something with the ImmoScout website has changed).

For now, you can either run the code as a cron job - set it to run every 10 minutes then quit (by disabling the 'loop' option), or you can run it as a systemd service (there is some documentation around that). Systemd will restart it when it exits.

TimeoutException is also possible. It's not a bad thing if that happens - the system will retry.

@heapxor
Copy link
Author

heapxor commented Sep 10, 2022

@codders, sounds cool.
i was thinking to run it every 4minutes, is that also okay?

okay disabling loop and execute via cron makes sense; in that case i can prevent the issue with the captcha and it should be Safe.

where do u set that timeoutexception ? or its in plan to be developed? thanks!

edit2
@codders also assuming there is no functionality to automate scenario as: send message to the new add

@codders
Copy link

codders commented Sep 10, 2022

Running more quickly is also okay for Immoscout. With ebay Kleinanzeigen you can get an IP block if you crawl too quickly. Just be aware there is no locking / concurrency control, so if the previous run didn't finish after 4 mins, you will have two flathunters at once, which will have weird effects.

For the timeouts and other errors, there is no plan right now. People who want it to be different make pull requests :)

@heapxor
Copy link
Author

heapxor commented Sep 11, 2022

@codders,
it stopped to work?:(

[2022/09/11 21:39:20|patcher.py              |INFO    ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/b023f19cc09c2dbf_chromedriver
Traceback (most recent call last):
  File "/home/flathunter/flathunter/flathunt.py", line 110, in <module>
    main()
  File "/home/flathunter/flathunter/flathunt.py", line 76, in main
    config.init_searchers()
  File "/home/flathunter/flathunter/flathunter/config.py", line 96, in init_searchers
    CrawlImmobilienscout(self),
  File "/home/flathunter/flathunter/flathunter/crawl_immobilienscout.py", line 39, in __init__
    self.driver = self.configure_driver(driver_arguments)
  File "/home/flathunter/flathunter/flathunter/abstract_crawler.py", line 59, in configure_driver
    driver = uc.Chrome(options=chrome_options)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 401, in __init__
    super(Chrome, self).__init__(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
    super().__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in __init__
    super().__init__(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 270, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 589, in start_session
    super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 363, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
    self.error_handler.check_response(response)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795
from chrome not reachable
Stacktrace:

any idea why?
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795
from chrome not reachable
Stacktrace:

@heapxor
Copy link
Author

heapxor commented Sep 11, 2022

hm so i commented out
# - "--remte-debugging-port=9222"
# - "--disable-dev-shm-usage"

and it works

@codders
Copy link

codders commented Sep 27, 2022

@heapxor Can we mark this issue as closed? Do you want to add some information to the README about your tips for making it work successfully?

@mourraille
Copy link

These driver arguments did the trick for me. Currently running on Ubuntu. I also had to install chrome as suggested by @heapxor

@codders
Copy link

codders commented Oct 18, 2022

okay. I'll mark this as closed. If you want to make a PR to update the documentation about the chrome requirement for 2captcha support, that would be very welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants