Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google sometimes serves a legacy page meant for bots and/or old browsers - this breaks the search controller and results in a TimeoutException. #11

Closed
pmnlla opened this issue Feb 18, 2023 · 4 comments

Comments

@pmnlla
Copy link

pmnlla commented Feb 18, 2023

In certain circumstances, Google serves an older search page without the appbar element. I have no clue what triggers it but I've managed to get it to appear consistently on Oracle Cloud Infrastructure IPs, even with desktop user agents.

The issue with this page is that even though ads might still appear and may be treated legitimately as this page is served to older browsers like IE11, utils.py is instructed to look for an element that is not present. This results in a TimeoutException, even if the page is fully loaded and ads are present.

Attached is a stacktrace that I spawned manually after the timeout error, and what the legacy search page looks like.

The two sites are certainly structured differently with different class names but I don't doubt that it should be trivial to figure out where the ads are on the legacy page, if google does decide to serve me some since my adblocker is off.

Traceback (most recent call last):
  File "/home/ubuntu/ad_clicker/search_controller.py", line 74, in search_for_ads
    results_loaded = wait.until(EC.visibility_of_element_located(self.RESULTS_CONTAINER))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/selenium/webdriver/support/wait.py", line 90, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
#0 0x55d55f4fa2a3 <unknown>
#1 0x55d55f2b8f77 <unknown>
#2 0x55d55f2f580c <unknown>
#3 0x55d55f2f5a71 <unknown>
#4 0x55d55f32f734 <unknown>
#5 0x55d55f315b5d <unknown>
#6 0x55d55f32d47c <unknown>
#7 0x55d55f315903 <unknown>
#8 0x55d55f2e8ece <unknown>
#9 0x55d55f2e9fde <unknown>
#10 0x55d55f54a63e <unknown>
#11 0x55d55f54db79 <unknown>
#12 0x55d55f53089e <unknown>
#13 0x55d55f54ea83 <unknown>
#14 0x55d55f523505 <unknown>
#15 0x55d55f56fca8 <unknown>
#16 0x55d55f56fe36 <unknown>
#17 0x55d55f58b333 <unknown>
#18 0x7f697f916b43 <unknown>

18-02-2023 16:00:08 [ INFO]  83: No ads in the search results!

image

@pmnlla
Copy link
Author

pmnlla commented Feb 18, 2023

I figured I should attach my chrome options.

    chrome_options = ChromeOptions()
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-infobars")
    chrome_options.add_argument("--disable-popup-blocking")
    chrome_options.add_argument("--disable-notifications")
    chrome_options.add_argument("--ignore-ssl-errors")
    chrome_options.add_argument("--start-maximized")
    chrome_options.add_argument(f"--user-agent={user_agent_str}")
    chrome_options.add_argument("--user-data-dir=/tmp/chrome")
    chrome_options.add_argument("--disable-setuid-sandbox")
    chrome_options.add_argument("--disable-application-cache")
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_argument("--enable-javascript")
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    
#if headless
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--window-size=1920,1080")

@pmnlla
Copy link
Author

pmnlla commented Feb 18, 2023

Update on this: the legacy page does not serve ads any more, as far as I'm aware. This means we need to figure out what triggers it, which I found out is some obscured fingerprinting code. Unless we can bypass it, mass deployment of this is likely toast after a while on instances dedicated to it.

I'm waiting for a friend to spin up some VMs for me on her home lab so I can test this using non-oracle IP ranges.

@coskundeniz
Copy link
Owner

It is related to selected user agent string. I updated it to take user agent only from the predefined list.

@pmnlla
Copy link
Author

pmnlla commented Feb 19, 2023

Got it. Pulled the latest commit, everything seems to work now. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants