Skip to content

ethanlim/ccmf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CCMF

Creative Commons Media-Fingerprint Library

CCMF library is an easy to use javascript client/node library that assist copyright owners to track their content throughout the web.



##Overview

###Philosophy

CCMF is intended to safeguard copyright owner's intellectual properties, by keeping track of the appearance of their content on the web. Eventually, CCMF would notify these owners if their copyright of their content has been violated on the web.

To facilitate this, CCMF would require a signature of author's original content. Subsequently, various search channels would compare with this signature. When the search channels chance upon a significantly similar web content, CCMF would verify that it belongs to the original content and identify it's author.

Another key idea is that the all computations can be accomplished via the client's browser or via the Node.JS environment. There is no centralised server requirement to use this library.

###Requirements

To register and submit content, users have to register an account with Creative Commons.

###Installation

####Client Browser

Add ccmf library into your page:

<script src='https://raw.github.com/ethanlim/ccmf/master/lib/build/ccmf.js'></script>

####Node Module

Insert ccmf into package.json and conduct a npm install:

"dependencies": {
  "ccmf":"git://github.com/ethanlim/ccmf.git#master"
}

###Quick Start

Create a text module object

var textMod = ccmf.Text.create();

Insert your credentials into a metadata object

var metadata = 	{
					author:{
							first:'test',
							last:'test',
							email:'test@test.com'
							}
		   	 	};

####Register

Execute the text module's register method to register any text content into Creative Common's database:

textMod.register(	
	registeringText,			//text content to be registered
	{k:9},						//shingles length : more on this below
	metadata,					//attached your metadata constructed above
	storeCallback				//callback if you would like to perform additional actions once text is stored
);

An example of a register callback function

var storeCallback = function(error){
					if(error===null){
						jQuery('#result').text("Text registered with creative commons");
					}
					else{
						console.log(error);
					}
				};

###Search

Execute the search method to search for similar textual content to yours.

textMod.search(
				textToBeSearched,		//Text that you are using to search for similar texts
				{k:9},					//shingles length : more on this below
				null,					//reserved for future usage
				resultCallback			//attach the callback that would execute once results are ready
			  );

The callback for search is slightly different as it returns the result of your search

resultCallback = function(results){
				
		//If there are any results
		if(results.count!=0){
			
			var resultSets = results['sets'],
			metadata = null,
			author = null,
			set = null;
		
			for(var result=0;result<results.count;result++){
				
				set = JSON.parse(resultSets[result]);	//Signature of the similar text
				
				metadata = set['metadata'];				//Get the metadata object (exactly as above)

				author = metadata['author']; 			//Get the author's detail
				
				console.log('Signature :'+set['sig'].toString().substring(0,30) +' Author : '+author['first']);
			}
		}
		else{
				console.log('No Similar Signature Found');
		}
	};

##Text Module

###General

The previous search and register methods use 3 components,namely shingles extraction, minhashing and locality-sensitive hashing of the text module. These components dissect the intended textual content into signatures (patterns of integers). These signatures preserve the relationship between that textual content with other contents. The three step process is represented by converting text into shingles, minhashing of shingles and finally conduct lsh. The end product is a signature that can be stored efficiently and be identified as similar to another textual content's signature.

text-register-image

####Shingles

Extracting shingles is the act of extracting sub-strings from a given text. Using ccmf's API, shingles can be extracted based on 3 different criteria:

  • Fixed Shingles

    The most basic shingles extraction. Simply extract each shingles of substring length k from the beginning to the end of text.

     var textAShingles = textMod.fixedShinglesWithoutWS(rawText,k);
  • Remove Stop Words Shingles

    Perform a removal of all stop words before conducting Fixed Shingles extraction.

     var textAShingles = textMod.removedStopWordShingles(rawText,k);
  • Stop More After Stop Word Shingles

    This is a different methodology of extractions. Each shingles are two words after the encountering of a stop word.

     var textAShingles = textMod.stopMoreShingles(rawText,k);

After extracting a set of shingles,they generally occupy more space then actual text themselves. Hence, we should minimize them by hashing them into an array of integers.

var shinglesFingerprintA = textMod.shinglesFingerprintConv(textAShingles);

####MinHash

Minhash is a technique or process of compressing the amount of data actually needed for comparison while preserving their inherit relationship with each other.

var signatures[0] = shinglesFingerprintA;

The previous compressed integer array could be loaded into an array of signatures. Use this array of signature if you would like to perform similar text matching solely on the browser. You can add N signatures to this signature array.

var signatures[1] = shinglesFingerprintB;

var signatures[2] = shinglesFingerprintC;

var signatures[3] = shinglesFingerprintD;

Now generate the minhash signatures (they can contain signatures from 1 or more text contents)

var minHashSignatures = this.minHashSignaturesGen(signatures);

####Locality-Sensitive Hashing (LSH)

To compare each and every pair of minhash signatures to determine the most similar pair would be too inefficient. Normally for this use case, we only need to focus on pairs of signatures that are most likely to be similar and not on every pair. The search functions uses the underlying locality-sensitive hashing (LSH). The art of locality-sensitive hashing is that through multiple hashing of a minhash signature, eventually the similar text content would be hashed to the same location.

The LSH belongs to the data module and so we have to first create the data module object.

var dataMod = ccmf.Data.create();

Next, create the callback to process the return data.

callback :function(snapshot){	
   /* Search through each band */	
   if(snapshot.val()!=null){	
     var foundSignatureSet = snapshot.val();	
  }	
}

Call the method in data module to conduct LSH.

dataMod.conductLsh(minHashSignatures,callback);

The callback function would be called and the similar minhash signatures would be returned.

##Feature Request and Bug Fixes

Submit all feature request and bug reports here.

##Versioning

Built on the rationale of providing maximum backward compatibility,CCMF adopts the Semantic Versioning v.2.0.0 guidelines.

Releases will be numbered with the following format:

<major>.<minor>.<patch>

eg. v1.2.12 represents the 1st major, 2nd minor and the 12th patch.

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

##Authors

Ethan Lim

##Miscellaneous

###Copyright & License

The MIT License (MIT)

Copyright (c) 2013 Lim Zhi Hao

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

###Theoretical Reading