Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

immoscout24 broken somehow #45

Closed
choeffer opened this issue Aug 4, 2020 · 20 comments · Fixed by #49
Closed

immoscout24 broken somehow #45

choeffer opened this issue Aug 4, 2020 · 20 comments · Fixed by #49

Comments

@choeffer
Copy link

choeffer commented Aug 4, 2020

I am using the url https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/koeln/wohnung-mieten?sorting=2 and since a few hours, I am just getting a long printout but not any results sent via telgramm bot anymore. Ebay Kleinanzeigen is still working fine.

file.log

If I can provide more info, please let me know.

@codders
Copy link

codders commented Aug 4, 2020

Hey there,

From your logs, it looks like Immoscout has detected that your flathunter is a bot and has blocked it. Can you browse the site normally in a webbrowser?

I've not encountered bot-detection on Immoscout before. It would be interesting to know if other users report the same thing.

@choeffer
Copy link
Author

choeffer commented Aug 4, 2020

Hey,

thanks for your fast reply. I can browse the website normally via Firefox. And I had the same problem at work as at home. So I assume it is not IP related. Does flathunter use some cookies or so?

@choeffer
Copy link
Author

choeffer commented Aug 4, 2020

it seems like immoscout is complaining that cookies are not used and JavaScript is disabled.

@codders
Copy link

codders commented Aug 4, 2020

Okay. I see the same thing on my machine. So it looks like they've upgraded their bot detection. I just tried here with a fake user agent (so it looks like Firefox instead of a Python script), but that doesn't help. I also tried here adding cookies support, and that doesn't fix it. I'll need to take a deeper look at what they're detecting, and I don't have any time to do that in the coming weeks I'm afraid.

But thanks for the report - this is something that will be affecting all users.

@choeffer
Copy link
Author

choeffer commented Aug 5, 2020

I also have gotten the same error page as the crawler in Firefox after doing some manual refreshs this morning. So it is not purely flathunter related. After a Google Captcha I was able to continue using immoscout with Firefox. The used Firefox is without any plugins or ad blockers.

@codders
Copy link

codders commented Aug 5, 2020 via email

@Bolland
Copy link

Bolland commented Aug 5, 2020

Same happens to my setup run on a raspberry pi. It also crashes my script every time 🤔

...
</div>\n<div class="main__part1">\n    Du bist ein Mensch aus Fleisch und Blut? Entschuldige bitte, dann hat unser System dich f\xc3\xa4lschlicherweise als Roboter identifiziert. Um unsere Services weiterhin zu nutzen, l\xc3\xb6se bitte diesen kurzen Test.\n</div>\n\n    <div class="main__captcha">\n        \n        <div class="container">\n            \n                    <script>\n                    showBlockPage()\n                    document.writeln(window.captchaDescription || "<p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p>");\n                    </script>\n                <div class="g-recaptcha" data-sitekey="6LeaILIZAAAAALTgLZV1AQXPc2dAsLItNYJ8jVvB" data-callback="solvedCaptcha"></div>\n        </div>\n    </div>\n\n<div class="main__part2">\n\n    <div class="main_part2_header1">Warum f\xc3\xbchren wir diese Sicherheitsma\xc3\x9fnahme durch?</div>\n<div class="main_part2_text1">Mit der Captcha-Methode stellen wir fest, dass du kein Roboter oder eine sch\xc3\xa4dliche Spam-Software bist.  Damit sch\xc3\xbctzen wir unsere Webseite und die Daten unserer Nutzerinnen und Nutzer vor betr\xc3\xbcgerischen Aktivit\xc3\xa4ten.</div>\n\n    <div class="main_part2_header2">Warum haben wir deine Suchanfragen blockiert?</div>\n    <div class="main_part2_text2">Es kann verschiedene Gr\xc3\xbcnde haben, warum wir dich f\xc3\xa4lschlicherweise als Roboter identifiziert haben. M\xc3\xb6glicherweise</div>\n\n</div>\n<div class="main__list">\n<ul>\n    <li>hast du die Cookies f\xc3\xbcr unsere Seite deaktiviert.</li>\n    <li>hast du die Ausf\xc3\xbchrung von JavaScript deaktiviert.</li>\n    <li>nutzt du ein Browser-Plugin eines Drittanbieters, beispielsweise einen Ad-Blocker.</li>\n<li>hast du in kurzer Zeit mehr Anfragen an unser System gestellt, als es \xc3\xbcblicherweise der Fall ist.</li>\n</ul>\n</div>\n\n\n</div>\n\n</div>\n\n<div class="footer">\n    <div class="footer-content">\n\n\n        <div>\n            <a href="https://www.immobilienscout24.de/unternehmen.html">\xc3\x9cber uns</a> |\n            <a href="https://www.immobilienscout24.de/kontakt.html">Kontakt & Hilfe</a> |\n            <a href="https://www.immobilienscout24.de/unternehmen/karriere/">Karriere</a> |\n            <a href="https://www.immobilienscout24.de/sitemap.html">Sitemap</a> |\n            <a href="https://api.immobilienscout24.de">Developer</a> |\n            <a href="https://www.immobilienscout24.de/unternehmen/mediendienst.html">Presseservice</a> |\n            <a href="https://www.immobilienscout24.de/ratgeber/newsletter.html">Newsletter abonnieren</a> |\n            <a href="https://www.immobilienscout24.de/impressum.html">Impressum</a> |\n            <a href="https://www.immobilienscout24.de/agb.html">AGB\'s & Rechtliche Hinweise</a> |\n            <a href="https://www.immobilienscout24.de/agb/verbraucherinformationen.html">Verbraucherinformationen</a> |\n            <a href="https://www.immobilienscout24.de/agb/datenschutz.html">Datenschutz</a> |\n            <a href="https://www.immobilienscout24.de/lp/Geodatenkodex.html">Datenschutz Kodex f\xc3\xbcr Geodatendienste</a> |\n            <a href="https://sicherheit.immobilienscout24.de">Sicherheit</a>\n        </div>\n        <div>\n            <!--<a href="">Immobiliensuche</a> | -->\n            <a href="https://www.scout24media.com/">Werbung</a> |\n            <a href="https://blog.immobilienscout24.de">Blog</a>\n            <!--|\n            <a href="">Nachbarschaft</a> |\n            <a href="">Gratis! E-Mail-Adresse @t-online.de</a>-->\n        </div>\n        <div>\n            <a href="https://www.immobilienscout24.de/">www.ImmobilienScout24.de</a>\n        </div>\n        <div class="legend">\n            \xc2\xa9 Copyright 1999 - 2020 Immobilien Scout GmbH\n        </div>\n    </div>\n\n</div>\n\n</body>\n</html>\n'
Traceback (most recent call last):
  File "flathunter.py", line 85, in <module>
    main()
  File "flathunter.py", line 81, in main
    launch_flat_hunt(config)
  File "flathunter.py", line 41, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 40, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 21, in crawl_for_exposes
    for searcher in self.config.searchers()
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 22, in <listcomp>
    for url in self.config.get('urls', list()) ])
  File "/home/pi/Development/flathunter/flathunter/abstract_crawler.py", line 12, in crawl
    return self.get_results(url, max_pages)
  File "/home/pi/Development/flathunter/flathunter/crawl_immobilienscout.py", line 41, in get_results
    while len(entries) < min(no_of_results, self.RESULT_LIMIT) and (max_pages is None or page_no < max_pages):
UnboundLocalError: local variable 'no_of_results' referenced before assignment

I can still access Immoscout without problems via chromium on the Pi though... 🤔

@bauer-jan
Copy link

I havnt had a deeper look into the implementation of the crawler. But maybe selenium would help with the bot detection. This is really sad i was so excited when i came accross this tool and wanted to use it for my personal flat hunt ;)

@namnoops
Copy link

namnoops commented Aug 7, 2020

Same experience here. My first thought was they block an IP that has made too many requests, but I cann access ImmoScout as usual with a browser. I suppose it has to do with the request headers or the lack of cookie and javascript support as was mentioned above.

@pcace
Copy link

pcace commented Aug 9, 2020

Hmm... so sad :/ same problem here...

Cheers

@choeffer
Copy link
Author

I have tried to use http://html.python-requests.org/ and https://selenium-python.readthedocs.io/ . But I am still getting the Google captcha thingy on immoscout24.

At least it is somehow easy to replace the way how the HTML content is received. After digging through the code, I was able to replace the used Python request package with the above mentioned by just applying changes in

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        resp = requests.get(url)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

from https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py

for selenium with Chrome

from selenium import webdriver

...

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        driver = webdriver.Chrome()
        driver.get(url)
        resp = driver.page_source()
        driver.quit()
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

for requests_html

from requests_html import HTMLSession

...

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        session = HTMLSession()
        resp = session.get(url)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

With both changes, at least ebay kleinanzeigen is still working fine.

@choeffer
Copy link
Author

With the help of https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth and the code

// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')
const fs = require("fs");

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: false }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/koeln/wohnung-mieten?sorting=2')
  await page.waitFor(5000)
  const html = await page.content();
  fs.writeFileSync("index.html", html);
  // await page.screenshot({ path: 'testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})

I was able to bypass the bot protection. But right now, this is more a prove of concept. The website is loading fine but continues loading until only an add is shown as the final content. But this could be starting point to bypass the immoscout24 bot protection.

@lomoien
Copy link

lomoien commented Aug 17, 2020

Too bad I'm having the same issue and can't run flathunter on ImmoScout..

@mordax7 mordax7 linked a pull request Aug 17, 2020 that will close this issue
@mordax7
Copy link

mordax7 commented Aug 17, 2020

Just merged a fix, use the latest code stand from the main branch.

Please let me know if it works now.

@lomoien
Copy link

lomoien commented Aug 17, 2020

Just merged a fix, use the latest code stand from the main branch.

Please let me know if it works now.

Thank you! Seems to run fine now. Do you know how I can check if the program loops after 5 minutes? For me nothing happens atm after I wait for the looptime configured inside of the config file.

@choeffer
Copy link
Author

It works now for me. Thanks for the patch.

@mordax7
Copy link

mordax7 commented Aug 17, 2020

Just merged a fix, use the latest code stand from the main branch.
Please let me know if it works now.

Thank you! Seems to run fine now. Do you know how I can check if the program loops after 5 minutes? For me nothing happens atm after I wait for the looptime configured inside of the config file.

Put the logs to verbose and check the output. I guess this is related to this issue: #50? Let move the chat to there.

It works now for me. Thanks for the patch.

Ok, closing the ticket.

@mordax7 mordax7 closed this as completed Aug 17, 2020
@choeffer
Copy link
Author

Does not seem to be solved. It has worked properly for a few times, but now I can see new offers on immoscout24 via Firefox which are not listed by flathunter. Maybe the response status is still 200 and it seems to work fine, but I do not think the requested content is delivered.

@choeffer
Copy link
Author

A print() of the HTML content reveals that the ouput is the same as the file.log from the first post.

@mordax7
Copy link

mordax7 commented Aug 17, 2020

Yes, they rolled out just a new version. It seems like they added cookies to their headers. Its another issue, I created a follow up: #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants