An API for selecting part of a document on the web based on a path to the content.
- Documentation: https://parsel-selector-api.herokuapp.com/redoc
- Swagger Documentation: https://parsel-selector-api.herokuapp.com/docs
Select these links for cool information about the world, powered by this API:
- The number of humans currently in space - Source Data
- Title of the current top post on all of Reddit - Source Data
- Cover of Amazon's best selling book - Source Data
How it works: Users pass the API a url do a document on the web, and a path to particular content on that page. The page is scraped, and the data requested is returned!
In the Scrapy python project, a framework for web scraping, the Parsel library is used to parse content scraped from the internet to get at the data the scraper wants. Getting page content looks something like this:
>>> fetch("https://old.reddit.com/")
>>> response.xpath('//title/text()').get()
'reddit: the front page of the internet'
The above example is how a developer might develop the correct xpath needed to get at the content they want, this typically involves a lot of trial and error.
In another project I created a GUI to do this task within a browser which works quite nicely. The goal of this project is to create an API that does the same thing and more.
This API serves 2 purposes.
- A standalone API where a user can get specific page content with a path, a useful tool for all sorts of projects.
- An API that can back up a static website built as a static tool to assist Scrapy users.
- Parse HTML with Xpath or CSS selectors as you would in Scrapy/Parsel.
- Parse JSON and XML with a path similar to an Xpath.
- Parse any text content on the internet with a Regex pattern.
- Test out how the site you're working on reacts to different User-Agents.
- Built with Fast API which provides Swagger and ReDoc documentation.
- Caching functionality on unique url/user_agent combos when the requests status_code = 200, suppressing the API from calling an endpoint too frequently.
You can clone this repo for your own hosted version, or you can use the hosted version at https://parsel-selector-api.herokuapp.com/docs
# Clone repo
git clone https://github.com/avi-perl/Parsel-Selector-API.git
cd Parsel-Selector-API
# Install requirements
pip install -r requirements.txt
# Run the app
uvicorn app.main:app --reload
Additional examples can be found in the examples folder.
import requests
# Example using the default BASIC return style
params = {
"url": "https://parsel-selector-api.herokuapp.com/examples/html",
"path": "/html/body/div/span[3]/text()",
"path_type": "XPATH"
}
r = requests.get("https://parsel-selector-api.herokuapp.com/parsel", params=params)
print(r.json())
Select the links below for documentation on how to structure your path for each type based on the library's used to power it.
Type | Library Used | Notes |
---|---|---|
XPATH | Parsel | Currently only supporting the .get() method. |
CSS | Parsel | Currently only supporting the .get() method. |
REGEX | Parsel | |
JSON | dpath | |
XML | XML converted to a dictionary by xmltodict, then parsed as JSON is with dpath |
This project has been mostly about learning, your pull requests and comments would be super appreciated!
- Add request cache so that the same URL is not called frequently.
- Add more tests on basic functionality.
- Create a front-end as a GUI for this tool.
- Add path parsing errors to the response for types other than XML.