Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Latest commit bad9c4a Mar 13, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.semaphore Tests Mar 13, 2019
__tests__
app
config
script
.gitignore
.nvmrc
Capfile
Gemfile Initial commit. Feb 17, 2019
Gemfile.lock
LICENSE
README.md
ecosystem.config.js
package-lock.json
package.json

README.md

Extract

Extract just the content from a web page.

Extract is a wrapper to turn the Mercury Parser into a web service.

Why?

Mercury already offers an API component, meant to be deployed to AWS Lambda. There are a few reasons why this exists as an alternative.

  1. Deploy elsewhere. Extract is a vanilla Node.js app, that is meant to run in a VM, and has no platform specific dependencies.

  2. Built-in authorization system.

  3. Performance. In my experience, running it on a VM has been faster than the lambda version.

Here's a graph where you can see a decrease in average response time around the 17. Feb mark. This is when Feedbin switched from the lambda hosted version, to extract running on a VPS.

Response Time

Installation

  1. Install Node.js and npm.

  2. Clone extract

    git clone https://github.com/feedbin/extract.git
  3. Install the dependencies.

    cd extract
    npm install
  4. Run the server

    node app/server.js

    Alternatively, extract includes an ecosystem.config.js to use with pm2. You could use this in production.

    npm install --global pm2
    pm2 start ecosystem.config.js

Usage

Extract has a simple, file-based system for creating users and secret keys. This allows users to be added/removed while the system is running. In the ./users directory, the filename is the username and the contents is the secret key. To make a new user, run the following:

cd extract
mkdir users

# use your own secret key and username
echo "SECRET_KEY" > users/USERNAME

Once a username and password has been created, you can make a request.

An example request looks like:

http://localhost:3000/parser/:username/:signature?base64_url=:base64_url

The parts that you need are:

  • username your username
  • signature the hexadecimal HMAC-SHA1 signature of the URL you want to parse
  • base64_url base64 encoded version of the URL you want to parse

The URL is base64-encoded to avoid any issues in the way different systems encode URLs. It must use the RFC 4648 url-safe variant with no newlines.

If your platform does not offer a URL safe base64 option, you can replicate it. First create the base64 encoded string. Then replace the following characters:

  • + => -
  • / => _
  • \n => ""

Here's a sample implementation in ruby. You can use this as a reference for matching your implementation.

require "uri"
require "openssl"
require "base64"

username = "username"
secret = "secret"
host = "localhost"
port = 3000
url = "https://feedbin.com/blog/2018/09/11/private-by-default/"

digest = OpenSSL::Digest.new("sha1")
signature = OpenSSL::HMAC.hexdigest(digest, secret, url)

base64_url = Base64.urlsafe_encode64(url).gsub("\n", "")

URI::HTTPS.build({
  host: host,
  port: port,
  path: "/parser/#{username}/#{signature}",
  query: "base64_url=#{base64_url}"
}).to_s

The above example would produce:

https://localhost:3000/parser/username/e4696f8630bb68c21d77a9629ce8d063d8e5f81c?base64_url=aHR0cHM6Ly9mZWVkYmluLmNvbS9ibG9nLzIwMTgvMDkvMTEvcHJpdmF0ZS1ieS1kZWZhdWx0Lw==

With the output:

{
    "title": "Private by Default",
    "author": null,
    "date_published": "2018-09-11T00:00:00.000Z",
    "dek": null,
    "lead_image_url": "https://assets.feedbin.com/assets-site/blog/2018-09-11/embed-3f43088538ae5ed7e585c00013adc13a915fd35de31990b3081a085b963ed7dd.png",
    "content": "<div>content</div>",
    "next_page_url": null,
    "url": "https://feedbin.com/blog/2018/09/11/private-by-default/",
    "domain": "feedbin.com",
    "excerpt": "September 11, 2018 by Ben Ubois I want Feedbin to be the opposite of Big Social. I think people should have the right not to be tracked on the Internet and Feedbin can help facilitate that. Since&hellip;",
    "word_count": 787,
    "direction": "ltr",
    "total_pages": 1,
    "rendered_pages": 1
}
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.