Skip to content

Document Formats

WinnowTag edited this page Sep 14, 2010 · 1 revision

There are two document formats that are important to the classifier. Firstly the Tag Definition Document (TDD) defines the manual examples for a tag, this makes up the positive and negative training used by the classifier. When a job is created in the classifier, the client provides a URL to the TDD and the classifier will fetch the TDD in order to build a Tagger for classifying items. Secondly the Classifier Tagging Document (CTD) defines the items which the classifier has applied a given tag to and strength for which the tag is applied. The CTD is created by the classifier as the output of the classification process, it is sent to the classifier tagging URL defined in the TDD which was used to create the Tagger that generated the CTD.

Both of these formats are based on the Atom Syndication Format with some minor extension elements

Tag Definition Document

The TDD is an Atom Feed document where the items in the feed represent the training items for a tag. All the semantics of the Atom Syndication Format apply along with some extra conditions and elements. The extensions and conditions will be illustrated by example:

Extension elements specific to the classifier are defined within the classifier’s namespace which is http://peerworks.org/classifier. This can be defined on the root element of the feed like so:


<?xml version="1.0" ?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:classifier="http://peerworks.org/classifier">

Next some additional metadata is required at the feed level.

This is an example of the feed level elements for a TDD:


  <id>http://trunk.mindloom.org:80/seangeo/tags/a-religion</id>
  <title>seangeo:a-religion</title>
  <updated>2008-06-24T18:37:44+09:30</updated>
  <link href="http://trunk.mindloom.org:80/seangeo/tags/a-religion.atom" rel="alternate"/>
  <link href="http://trunk.mindloom.org:80/seangeo/tags/a-religion/training.atom" rel="self"/>

  <!-- The following four elements are required by the classifier -->
  <link href="http://trunk.mindloom.org:80/seangeo/tags/a-religion/classifier_taggings.atom"
             rel="http://peerworks.org/classifier/edit"/>
  <category scheme="http://trunk.mindloom.org:80/seangeo/tags/" term="a-religion"/>
  <classifier:bias>1.2</classifier:bias>
  <classifier:classified>2008-06-25T00:47:36+09:30</classifier:classified>

The required elements serve the following purposes:

  • The required link element specifies the URL which the Classifier Tagging Document should be sent to when classification is complete. The link is identified using the http://peerworks.org/classifier/edit value for the rel attribute.
  • The required category element specifies the scheme and term used for all category elements relating to the tag in both TDD and CTD documents, this means that positive items in the TDD will use the scheme and term and classifier generated taggings in the CTD will use the scheme and term.
  • The classifier:bias element defines the classification bias as a float, values less than 1.0 will make the classifier more conservative whereas values greater than 1.0 will make it less conservative.
  • The classifier:classified provides the time when the tag was last classified.

While only four of the elements are required by the classifier they should all be provided anyway to ensure compliance with the Atom spec.

Following the feed level metadata are the items that make up the training data for the tag. These are specified as atom:entry elements using the full content of the items. The classifier specifies two conditions to identify whether an item is a positive or negative example.

TODO

Classifier Tagging Document

TODO

Clone this wiki locally