places-spider

Trello - join kanban board for tracking websites

A project to extract GeoJSON from the web focusing on websites that have 'store locator' pages like restaurants, gas stations, retailers, etc. Each chain has its own bit of software to extract useful information from their site (a "spider"). Each spider can be individually configured to throttle request rate to act as a good citizen on the Internet.

The project is built using scrapy, a Python-based web scraping framework. Each target website gets its own spider, which does the work of extracting interesting details about locations and outputting results in a useful format.s

Adding a spider

To scrape a new website for locations, you'll want to create a new spider. You can copy from existing spiders or start from a blank, but the result is always a Python class that has a process() function that yields GeojsonPointItems. The Scrapy framework does the work of outputting the GeoJSON based on these objects that the spider generates.

Development setup

To get started, you'll want to install the dependencies for this project.

This project uses pipenv to handle dependencies and virtual environments. To get started, make sure you have pipenv installed.
With pipenv installed, make sure you have the all-the-places repository checked out
```
git clone https://gitlab.com/geo-spider/places-spider.git
```
Then you can install the dependencies for the project
```
cd places-spider
pipenv install
```
After dependencies are installed, make sure you can run the scrapy command without error
```
pipenv run scrapy
```
If pipenv run scrapy ran without complaining, then you have a functional scrapy setup and are ready to write a scraper.

Test

pipenv run scrapy crawl avoska_dac --output=avoska.geojson

Name		Name	Last commit message	Last commit date
Latest commit History 323 Commits
locations		locations
scripts		scripts
.gitignore		.gitignore
DATA_FORMAT.md		DATA_FORMAT.md
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locations

locations

scripts

scripts

.gitignore

.gitignore

DATA_FORMAT.md

DATA_FORMAT.md

Dockerfile

Dockerfile

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

scrapy.cfg

scrapy.cfg

Repository files navigation

places-spider

Adding a spider

Development setup

Test

About

Releases

Packages

Contributors 18

Languages

abhipriya25/Spider-Project

Folders and files

Latest commit

History

Repository files navigation

places-spider

Adding a spider

Development setup

Test

About

Resources

Stars

Watchers

Forks

Languages