pypagerest

A Python wrapper for Page.REST

Page.REST is an HTTP API created by Lakshan Perera that can extract content from any web page as JSON.

This wrapper makes it easier to access Page.REST using Python. It also make it easier to request data about multiple URLs.

Requirements

You'll need to buy an access token for Page.REST. Tokens cost $5 and are valid for 365 days. There's a daily cap of 100,000 requests per token.

Pypagerest won't work with versions of Python lower than Python 3.5.

Installation

The package isn't on PyPI yet. You can install the live pypagerest code from Github using pipev (for a relatively straightforward virtual environments setup) or alternatively with pip/pip3.

pipenv install -e git+https://github.com/edjw/pypagerest#egg=pypagerest

pip install git+https://github.com/edjw/pypagerest
pip3 install git+https://github.com/edjw/pypagerest

Using pypagerest

Setup

You have to set some variables in your file that imports pypagerest. Pypagerest will use these variables to retrieve the data you want.

import pypagerest

# #Insert your Page.rest Access token here
pr_token = "Page.Rest_Access_token"

# Insert urls here inside square brackets in quotes.
# If more than one url then separate with commas.
# Use square brackets even if there's only one url. I'm going to fix this.

urls = ["https://domain.tld"] # One URL

*OR*

urls = ["https://domain.tld", "https://anotherdomain.tld"] # More than one URL

# If you want to extract content using CSS selectors, put the selectors inside square brackets like this
selectors = [".class_one", ".class_two", "#id_one", "#id_two", "h1", "p"]

# If you want to extract HTTP response headers, put the response headers inside square brackets like this
headers = ["X-Frame-Options", "X-XSS-Protection", "Content-Security-Policy"]

pypagerest functions

Then call whichever of these functions you want to correspond with the functionality described on https://page.rest

Basic

Grab site title, description, logo, favicons, canonical URL, status code, and Twitter handle as described at https://page.rest/#basic

pypagerest.get_pr_basic(pr_token, urls)

Selector queries

Use CSS selectors to retrieve content from matching elements. You can use up to 10 selector queries as described at https://page.rest/#selector-queries

pypagerest.get_pr_selector(pr_token, urls, selectors)

Pre-render content

Extract content from pages that render on client-side using JavaScript as described at https://page.rest/#prerender

pypagerest.get_pr_prerender(pr_token, urls, selectors)

Embed content

Get the oEmbed content for the page as part of the response (only if available) as described at https://page.rest/#embed-content

pypagerest.get_pr_oembed(pr_token, urls)

Open Graph

Get the OpenGraph content for the page as part of the response (only if available) as described at https://page.rest/#open-graph

pypagerest.get_pr_opengraph(pr_token, urls)

Response headers

Get the OpenGraph content for the page as part of the response (only if available) as described at https://page.rest/#open-graph

pypagerest.get_pr_responseheaders(pr_token, urls, headers)

Output

If you request multiple URLs, the output is a Python list of the JSON responses from Page.REST.

If you request only a single URL, the output is the JSON response from Page.REST not in a list. Let me know if you can think of a more useful way of returning the data for real-world use.

Notes

I have not tested this on any version of Python except 3.6.

The json module is necessary because the json decoder in requests uses single quote marks ' around keys which I found confuses json formatting where there are single quote marks in the values, eg. quotes in the titles of webpages. Let me know if you can see a way of removing the json dependency.

Contributing

I am new to Python and would appreciate any suggestions of how to improve pypagerest. If you want to contribute, please submit a pull request! :-)

Fork this repository
Create your feature branch git checkout -b my-new-feature
Commit your changes git commit -am 'Add some feature'
Push to the branch git push origin my-new-feature
Create a new Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
pypagerest		pypagerest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pypagerest

Requirements

Installation

Using pypagerest

Setup

pypagerest functions

Basic

Selector queries

Pre-render content

Embed content

Open Graph

Response headers

Output

Notes

Contributing

About

Releases

Packages

Languages

License

edjw/pypagerest

Folders and files

Latest commit

History

Repository files navigation

pypagerest

Requirements

Installation

Using pypagerest

Setup

pypagerest functions

Basic

Selector queries

Pre-render content

Embed content

Open Graph

Response headers

Output

Notes

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages