An Elixir-based crawler using Hound and Floki, with a webdriver. This scraper ran well with Elixir 1.4.1 and Erlang/OTP 25, on Arch Linux.
Since we need to interact with the page a little, you'll need to install a webdriver.
- Download and install a chromedriver here (make sure it supports your Google Chrome version): Versions 115 & later, Older Versions
- Run in terminal:
chromedriver
- Download and install a Selenium webdriver here (make sure you have java installed): Selenium Driver
- Run the webdriver with
java -jar selenium-server-standalone-3.9.1.jar
- Run in terminal:
mix deps.get - Run in terminal:
iex -S mix run -e "Burpple.Hound.run"for Burppleiex -S mix run -e "Lemon8.Hound.run"for Lemon8iex -S mix run -e "Explorest.Hound.run"for Explorestiex -S mix run -e "GMaps.Hound.run"for Google Maps
For burpple, the config should be: config :hound, driver: "chrome_driver", browser: "chrome_headless", server: true
For lemon8 or explorest, the config should be: config :hound, host: "http://localhost", port: 4444, path_prefix: "wd/hub/"
@neighbourhood: Sets the neighbourhood to be scraped. Ensure that it's available on Burpple.
@limit: Sets when to stop scraping (default 4000)
@file_path: Sets the file path to save data in
@url: Sets the url to be scraped, currently scraping #singaporefood page of Lemon8.
@file_path: Sets the file path to save data in
@neighbourhood: Sets the neighbourhood to be scraped. Ensure that it's available on Explorest.
@file_path: Sets the file path to save data in
@search_term: Sets the search term to input into Google Maps
@file_path: Sets the file path to save data in
@limit: Sets when to stop scraping (default 100)
@panels_skip: Sets the number of panels to skip (to avoid clicking ads; see photo below)
