Skip to content

Parser for Microdata and JSON-LD from HTML documents

License

Notifications You must be signed in to change notification settings

astappiev/microdata

 
 

Repository files navigation

Microdata

Microdata is a package to extract Microdata and JSON-LD from HTML documents.

HTML Microdata is a markup specification often used in combination with the schema collection to make it easier for search engines to identify and understand content on web pages. One of the most common schemas is the rating you see when you google for something. Other schemas are persons, places, events, products, etc.

JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale.

Go package use

Install the package:

go get -u github.com/astappiev/microdata

Use cases:

// Pass a URL to the `ParseURL` function.
data, err := microdata.ParseURL("https://example.com/page")

// Pass a `io.Reader`, content-type and a base URL to the `ParseHTML` function.
data, err := microdata.ParseHTML(reader, contentType, baseURL)

// Pass a `html.Node`, content-type and a base URL to the `ParseNode` function.
data, err := microdata.ParseNode(reader, contentType, baseURL)

An example program:

package main

import (
    "encoding/json"
    "fmt"

    "github.com/astappiev/microdata"
)

func main() {
    data, _ := microdata.ParseURL("https://www.allrecipes.com/recipe/84450/ukrainian-red-borscht-soup/")
    
    // iterate over metadata items:
    items := data.Items
	for _, item := range items {
		fmt.Println(item.Types)
		for key, prop := range item.Properties {
			fmt.Printf("%s: %v\n", key, prop)
		}
	}

    // print json schema
    jsonSchema, _ := json.MarshalIndent(data, "", "  ")
    fmt.Println(string(jsonSchema))
}

Command line use

Install the command line tool:

go install github.com/astappiev/microdata/cmd/microdata

Parse a URL:

microdata https://www.gog.com/game/...
{
  "items": [
    {
      "type": [
        "http://schema.org/Product"
      ],
      "properties": {
        "additionalProperty": [
          {
            "type": [
              "http://schema.org/PropertyValue"
            ],
{
...

Parse HTML from the stdin:

$ cat saved.html | microdata

Format the output with a Go template to return the "price" property:

microdata -format '{{with index .Items 0}}{{with index .Properties "offers" 0}}{{with index .Properties "price" 0 }}{{ . }}{{end}}{{end}}{{end}}' https://www.gog.com/game/...
8.99

About

Parser for Microdata and JSON-LD from HTML documents

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Go 100.0%