Web Plucker

webpluck scrapes a specific values from a web page. It works as a standalone binary as well as in a API mode.

Download

Latest Releases

webpluck is available for 64 bit linux, OS X and Windows systems. Latest versions can be downloaded from the Release tab above. This is the preferred way.

Build from source

This is a golang project. Assuming you have golang compiler installed, the following will build the binary from scratch

$ git clone https://github.com/codeexpress/webpluck
$ cd webpluck
$ go build -o webpluck cmd/main.go

Usage

webpluck takes the following input:

URL of the webpage
XPATH of the element
optional regex to further narrow the selection

and outputs the selected value.

1. URL of the webpage

This is the link of the webpage which has the desired information we want to extract.

For example, if we want to scrape the founders of the StackOverflow website from its company page, the URL is: https://stackoverflow.com/company. The desired value we want to extract is: Joel Spolsky and Jeff Atwood

2. XPATH of the element

This is the link of the xpath of the element on the page that contains the required information. A good way to get this information is to (on a chrome browser):

Right click on the place were the information is present
Click "Inspect" to open the Chrome developer tools window with the element highligted
On the highlighed value in the HTML source code, Right click -> Copy -> Copy xpath
The copied value is the xpath we need

Get xpath Step 1	Get xpath Step 2

The xpath in example above comes out to be: //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()

3. `regex` to pluck the right value

Note the the xpath above leads us to the value: Joel Spolsky and Jeff Atwood launch Stack Overflow Since we want to trim that down further, we'll provide a regex value to extract just the names. This regex will fetch just the names (the value in parenthesis): ^(*.) launch .*

Sample hosted invocation

webpluck can be run as a standalone binary. To extract the names using the three params we just obtained, copy the targets.yml file and populate it with the parameters. The resulting targets.yml should look like this:

targetList:
  -   name: stackoverflow_founders
      baseUrl: https://stackoverflow.com/company
      xpath:  //*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()
      regex: ^(.*) launch .*

Now invoke webpluck as follows and obtain the answer:

$ ./webpluck_osx -f /path/to/targets.yml
{
  "stackoverflow_founders": "Joel Spolsky and Jeff Atwood"
}

Sample API invocation

webpluck can be run in server mode as well. Thereafter, clients written in other programming languages can scrape web pages using the webpluck API over the network.

To run webpluck in server mode listening on localhost on 8080:

$ ./webpluck -p 8080

An instance of webpluck API is running at https://api.code.express/webpluck/. You can use that for your light extraction needs. If your load is heavy, consider spinning your own server running webpluck

Armed with the knowledge of baseUrl, xpath and regex, we can now call the API endpoint by POSTing these three params: Example curl invocation for the server mode:

curl 'https://api.code.express/webpluck/' \
      --data-urlencode 'baseUrl=https://stackoverflow.com/company' \
      --data-urlencode 'xpath=//*[@id="content"]/section[3]/ol/li[1]/ol/li[2]/text()' \
      --data-urlencode 'regex=^(.*) launch .*' -g

The result from the API is as follows. The pluckedData field returns the value extracted:

{
  "baseUrl": "https://stackoverflow.com/company",
  "pluckedData": "Joel Spolsky and Jeff Atwood",
  "regex": "^(.*) launch .*",
  "xpath": "//*[@id=\"content\"]/section[3]/ol/li[1]/ol/li[2]/text()"
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cmd		cmd
logger		logger
webpluck		webpluck
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
targets.yml		targets.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Plucker

Download

Latest Releases

Build from source

Usage

1. URL of the webpage

2. XPATH of the element

3. `regex` to pluck the right value

Sample hosted invocation

Sample API invocation

About

Releases 2

Packages

Languages

License

codeexpress/webpluck

Folders and files

Latest commit

History

Repository files navigation

Web Plucker

Download

Latest Releases

Build from source

Usage

1. URL of the webpage

2. XPATH of the element

3. regex to pluck the right value

Sample hosted invocation

Sample API invocation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

3. `regex` to pluck the right value

Packages