Standalone (batch- and command-line) and Gradle-plugin html sanity checker - detects missing images, dead links and cross-references, duplicate link targets (anchors) and the like.
Groovy Java HTML CSS
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config modularized build, reduced nr of gradle-versions to test in functiona… Jul 26, 2018
gradle/wrapper bumped up gradle wrapper version to 4.9 (this version issued might ha… Jul 22, 2018
lib make tests green again, add codenarc (v1.2) Jul 11, 2018
sandbox played around with RegEx for ImageFile names Jul 3, 2015
src updated test due to modified error message Aug 15, 2018
.gitignore Merge branch 'master' of https://github.com/aim42/htmlSanityCheck Jun 23, 2018
.travis.yml build with stacktrace, removed some gradle versions in testkit config… Jul 23, 2018
LICENSE.txt initial upload of sources - including string similarity search (by @r… May 31, 2014
README.adoc first commit for the 1.0.0-RC-2 release: fixed #152 Aug 15, 2018
build.gradle fixed #98, fixed #178: htmlSanityCheck now checks its own documentation Aug 15, 2018
changelog.txt modularized build, reduced nr of gradle-versions to test in functiona… Jul 26, 2018
gradle.properties first commit for the 1.0.0-RC-2 release: fixed #152 Aug 15, 2018
gradlew bumped up gradle wrapper version to 4.9 (this version issued might ha… Jul 22, 2018
gradlew.bat Gradle task sends output to info log. Merge misc. fixes. Dec 12, 2016
htmlsanitycheck-logo.png included logo Sep 3, 2014
icon.png added icon to enable avatar in SourceTree git client Oct 16, 2017
sample-hsc-report.jpg added sample report (with thumbnail) to readme Feb 5, 2015
settings.gradle again, minor updates. Jul 22, 2018
structure101-logo.png add structure101 logo Dec 11, 2015

README.adoc

Html-SC Html Sanity Check

This project provides some basic sanity checking on html files.

It can be helpful in case of html generated from e.g. Asciidoctor, Markdown or other formats - as converters usually don’t check for missing images or broken links.

It can be used as Gradle plugin. Standalone Java and graphical UI are planned for future releases.

https://waffle.io/aim42/htmlsanitycheck License ccsa4 green Build Status

Installation

Use the following snippet inside a Gradle build file:

build.gradle
plugins {
  id 'org.aim42.htmlSanityCheck' version '1.0.0-RC-1'
}

OR

build.gradle
buildscript {
    repositories {
        maven {
              url "https://plugins.gradle.org/m2/"
            }
    }

    dependencies {
        classpath ('gradle.plugin.org.aim42:htmlSanityCheck:1.0.0-RC-1')
    }
}

apply plugin: 'org.aim42.htmlSanityCheck'

Usage

The plugin adds a new task named htmlSanityCheck.

This task exposes a few properties as part of its configuration:

sourceDir

(mandatory) directory where the html files are located. Type: File. Default: build/docs.

sourceDocuments

(optional) an override to process several source files, which may be a subset of all files available in ${sourceDir}. Type: Set. Defaults to all files in ${sourceDir}.

checkingResultsDir

(optional) directory where the checking results written to. Defaults to ${buildDir}/report/htmlchecks/

junitResultsDir

(optional) directory where the results written to in JUnit XML format. JUnit XML can be read by many tools including CI environments. Defaults to ${buildDir}/test-results/htmlchecks/

failOnErrors

(optional) if set to "true", the build will fail if any error was found in the checked pages. Defaults to false

Examples

build.gradle (small example)
apply plugin: 'org.aim42.htmlSanityCheck'

htmlSanityCheck {
    sourceDir = new File( "$buildDir/docs" )

    // files to check - in Set-notation
    sourceDocuments = [ "one-file.html", "another-file.html", "index.html"]

    // where to put results of sanityChecks...
    checkingResultsDir = new File( "$buildDir/report/htmlchecks" )

    // fail build on errors?
    failOnErrors = true
}
build.gradle (extensive example)
buildscript {
    repositories {
        maven {
            url "https://plugins.gradle.org/m2/"
        }
        jcenter()
    }
}


plugins {
    id 'org.aim42.htmlsanitycheck' version '1.0.0-RC-1'
    id 'org.asciidoctor.convert' version '1.5.8'
}


// ==== path definitions =====
// ===========================

// location of AsciiDoc files
def asciidocSrcPath = "$projectDir/src/asciidoc"

// location of images used in AsciiDoc documentation
def srcImagesPath = "$asciidocSrcPath/images"

// results of asciidoc compilation (HTML)
// (input for htmlSanityCheck)
// this is the default path for asciidoc-gradle-convert
def htmlOutputPath = "$buildDir/asciidoc/html5"

// images used by generated html
def targetImagesPath =   htmlOutputPath + "/images"

// where HTMLSanityCheck checking results ares stored
def checkingResultsPath = "$buildDir/report/htmlchecks"


apply plugin: 'org.asciidoctor.convert'

asciidoctor {
    sourceDir = new File( asciidocSrcPath )

    options backends: ['html5'],
            doctype: 'book',
            icons: 'font',
            sectlink: true,
            sectanchors: true

    resources {
        from( srcImagesPath )
        into targetImagesPath
    }


}

apply plugin: 'org.aim42.htmlSanityCheck'


htmlSanityCheck {

    // ensure asciidoctor->html runs first
    // and images are copied to build directory

    dependsOn asciidoctor

    sourceDir = new File( htmlOutputPath )

    // files to check, in Set-notation
    sourceDocuments = [ "many-errors.html", "no-errors.html"]

    // where to put results of sanityChecks...
    checkingResultsDir = new File( checkingResultsPath )

   // fail build on errors?
    failOnErrors = false
}

Typical Output

The overall goal is to create neat and clear reports, showing evantual errors within HTML files - as shown in the adjoining figure.

sample hsc report

Types of Sanity Checks

Finds all '<a href="XYZ">' where XYZ is not defined.

src/broken.html
<a href="#missing>internal anchor</a>
...
<h2 id="missinG">Bookmark-Header</h2>

In this example, the bookmark is misspelled.

Missing Images Files

Images, referenced in '<img src="XYZ"…​' tags, refer to external files. The existence of these files is checked by the plugin.

Multiple Definitions of Bookmarks or ID’s

If any is defined more than once, any anchor linking to it will be confused :-)

Missing Local Resources

All files (e.g. downloads) referenced from html.

Missing Alt-tags in Images

Image-tags should contain an alt-attribute that the browser displays when the original image file cannot be found or cannot be rendered. Having alt-attributes is good and defensive style.

The current version contains a somewhat naive implementation that gets the HTTP response from a HEAD request and identifies errors (status >400) and warnings (status 1xx or 3xx).

Future plans include configurable ranges (as some people might want some content behind paywalls NOT to result in errors…​)

Localhost or numerical IP addresses are NOT marked as suspicious.

Please comment in case you have additional requirements.

planned: ftp, ntp or other protocols are currently not checked, but should…​

Technical Documentation

In addition to checking HTML, this project serves as an example for arc42.

Fundamentals

This tiny piece rests on incredible groundwork:

  • Jsoup HTML parser and analysis toolkit - robust and easy-to-use.

  • IntelliJ IDEA - my (Gernot) best (programming) friend.

  • Of course, Groovy, Gradle, JUnit and Spockframework.

Ideas and Origin

  • The plugin heavily relies on code provided by the Gradle project.

  • Inspiration on code organization, implementation and testing of the plugin came from the Asciidoctor-Gradle-Plugin by [@AAlmiray].

  • Code for string similarity calculation by Ralph Rice.

  • Initial implementation, maintenance and documentation by Gernot Starke.

Development

Several sources provided help during development:

Similar Projects

  • The gradle-linkchecker-plugin is an (open source) gradle plugin which validates that all links in a local HTML file tree go out to other existing local files or remote web locations. It creates a simple text file report and might be a complement to this HtmlSanityChecker.

  • Benjamin Muschko has created a (go-based) command-line tool to check links, called link verifier

Contributing

Please report issues or suggestions.

Want to improve the plugin: Fork our repository and send a pull request.

Licence

Currently code is published under the Apache-2.0 licence, documentation under Creative-Commons-Sharealike-4.0.

Some day I’ll unify that :-)

Big thanx to Structure-101 for helping us analyze and restructure our code…​

structure101 logo