Skip to content

fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter#3043

Merged
ZanSara merged 20 commits intodeepset-ai:mainfrom
danielbichuetti:update_crawler_webdriver_options
Aug 22, 2022
Merged

fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter#3043
ZanSara merged 20 commits intodeepset-ai:mainfrom
danielbichuetti:update_crawler_webdriver_options

Conversation

@danielbichuetti
Copy link
Contributor

@danielbichuetti danielbichuetti commented Aug 15, 2022

Related Issues

Proposed Changes:

  • Add a webdriver_options parameter, so the user can set the parameters that his specific scenario demands
  • Automatically set some Webdriver options in Azure ML and Google Colab to avoid crashes
    • --disable-dev-shm-usage
  • Automatically set some Webdriver options in scenario where Chrome will run as root to avoid crashes
  • --no-sandbox
  • Change default parameters:
    • --disable-gpu There is no need to enable GPU acceleration on a text crawler. GPU is expensive in the cloud.
    • --disable-dev-shm-usage Usually in container environments there is no access to shared memory, or it's set with the default size of 64 MB. Disabling its usage, Chrome will write to a temporary directory. Using shared memory may improve performance just in high testing workloads, which is not our use case.
      - --single-process As haystack doesn't support multi-tabs, there is no point to let Chrome spawn multiple processes. It will just increase the memory footprint.

How did you test it?

  • Unit tests
  • Manual tests based on python:3.9-slim-bullseye container image
  • Manual tests inside Azure ML
  • Manual tests inside haystack Dockerfiles

Notes for the reviewer

This changes essentially will make haystack Crawler node able to run in multiple environments without any user extra effort. In case any user has more specific needs, it will be possible using the parameter.

Checklist

@danielbichuetti danielbichuetti requested review from a team as code owners August 15, 2022 17:12
@danielbichuetti danielbichuetti changed the title [[BUG] Update crawler webdriver options [BUG] Update crawler webdriver options Aug 15, 2022
@danielbichuetti danielbichuetti changed the title [BUG] Update crawler webdriver options bug: update crawler webdriver options Aug 15, 2022
Copy link
Contributor

@agnieszka-m agnieszka-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I added some minor language comments to comply with our writing guidelines. Thanks!

Copy link
Contributor

@agnieszka-m agnieszka-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@danielbichuetti danielbichuetti changed the title bug: update crawler webdriver options fix: update crawler webdriver options Aug 16, 2022
@danielbichuetti danielbichuetti requested a review from sjrl August 18, 2022 18:56
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2022

@danielbichuetti This is looking good to me! Although I think we should have someone from the core-engineering team for the final approval.

@danielbichuetti danielbichuetti changed the title fix: update crawler webdriver options bug: update crawler webdriver options Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: update crawler webdriver options bug: fix Crawler on restricted environment containers and add Webdriver options as parameter Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: fix Crawler on restricted environment containers and add Webdriver options as parameter bug: fix Crawler on restricted environment containers and add ChromeDriver options as parameter Aug 19, 2022
@danielbichuetti danielbichuetti changed the title bug: fix Crawler on restricted environment containers and add ChromeDriver options as parameter fix: update ChromeDriver options on restricted environments and add ChromeDriver options as function parameter Aug 19, 2022
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@ZanSara ZanSara merged commit d715d02 into deepset-ai:main Aug 22, 2022
@danielbichuetti danielbichuetti deleted the update_crawler_webdriver_options branch August 22, 2022 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crawler is crashing in some container environments

5 participants