A Python wrapper for Page.REST
Page.REST is an HTTP API created by Lakshan Perera that can extract content from any web page as JSON.
This wrapper makes it easier to access Page.REST using Python. It also make it easier to request data about multiple URLs.
You'll need to buy an access token for Page.REST. Tokens cost $5 and are valid for 365 days. There's a daily cap of 100,000 requests per token.
Pypagerest won't work with versions of Python lower than Python 3.5.
The package isn't on PyPI yet. You can install the live pypagerest code from Github using pipev
(for a relatively straightforward virtual environments setup) or alternatively with pip
/pip3
.
pipenv install -e git+https://github.com/edjw/pypagerest#egg=pypagerest
pip install git+https://github.com/edjw/pypagerest
pip3 install git+https://github.com/edjw/pypagerest
You have to set some variables in your file that imports pypagerest. Pypagerest will use these variables to retrieve the data you want.
import pypagerest
# #Insert your Page.rest Access token here
pr_token = "Page.Rest_Access_token"
# Insert urls here inside square brackets in quotes.
# If more than one url then separate with commas.
# Use square brackets even if there's only one url. I'm going to fix this.
urls = ["https://domain.tld"] # One URL
*OR*
urls = ["https://domain.tld", "https://anotherdomain.tld"] # More than one URL
# If you want to extract content using CSS selectors, put the selectors inside square brackets like this
selectors = [".class_one", ".class_two", "#id_one", "#id_two", "h1", "p"]
# If you want to extract HTTP response headers, put the response headers inside square brackets like this
headers = ["X-Frame-Options", "X-XSS-Protection", "Content-Security-Policy"]
Then call whichever of these functions you want to correspond with the functionality described on https://page.rest
Grab site title, description, logo, favicons, canonical URL, status code, and Twitter handle as described at https://page.rest/#basic
pypagerest.get_pr_basic(pr_token, urls)
Use CSS selectors to retrieve content from matching elements. You can use up to 10 selector queries as described at https://page.rest/#selector-queries
pypagerest.get_pr_selector(pr_token, urls, selectors)
Extract content from pages that render on client-side using JavaScript as described at https://page.rest/#prerender
pypagerest.get_pr_prerender(pr_token, urls, selectors)
Get the oEmbed content for the page as part of the response (only if available) as described at https://page.rest/#embed-content
pypagerest.get_pr_oembed(pr_token, urls)
Get the OpenGraph content for the page as part of the response (only if available) as described at https://page.rest/#open-graph
pypagerest.get_pr_opengraph(pr_token, urls)
Get the OpenGraph content for the page as part of the response (only if available) as described at https://page.rest/#open-graph
pypagerest.get_pr_responseheaders(pr_token, urls, headers)
If you request multiple URLs, the output is a Python list of the JSON responses from Page.REST.
If you request only a single URL, the output is the JSON response from Page.REST not in a list. Let me know if you can think of a more useful way of returning the data for real-world use.
I have not tested this on any version of Python except 3.6.
The json module is necessary because the json decoder in requests uses single quote marks '
around keys which I found confuses json formatting where there are single quote marks in the values, eg. quotes in the titles of webpages. Let me know if you can see a way of removing the json dependency.
I am new to Python and would appreciate any suggestions of how to improve pypagerest. If you want to contribute, please submit a pull request! :-)
- Fork this repository
- Create your feature branch
git checkout -b my-new-feature
- Commit your changes
git commit -am 'Add some feature'
- Push to the branch
git push origin my-new-feature
- Create a new Pull Request