Trello - join kanban board for tracking websites
A project to extract GeoJSON from the web focusing on websites that have 'store locator' pages like restaurants, gas stations, retailers, etc. Each chain has its own bit of software to extract useful information from their site (a "spider"). Each spider can be individually configured to throttle request rate to act as a good citizen on the Internet.
The project is built using scrapy
, a Python-based web scraping framework. Each target website gets its own spider, which does the work of extracting interesting details about locations and outputting results in a useful format.s
To scrape a new website for locations, you'll want to create a new spider. You can copy from existing spiders or start from a blank, but the result is always a Python class that has a process()
function that yield
s GeojsonPointItems
. The Scrapy framework does the work of outputting the GeoJSON based on these objects that the spider generates.
To get started, you'll want to install the dependencies for this project.
-
This project uses
pipenv
to handle dependencies and virtual environments. To get started, make sure you havepipenv
installed. -
With
pipenv
installed, make sure you have theall-the-places
repository checked outgit clone https://gitlab.com/geo-spider/places-spider.git
-
Then you can install the dependencies for the project
cd places-spider pipenv install
-
After dependencies are installed, make sure you can run the
scrapy
command without errorpipenv run scrapy
-
If
pipenv run scrapy
ran without complaining, then you have a functionalscrapy
setup and are ready to write a scraper.
pipenv run scrapy crawl avoska_dac --output=avoska.geojson