-
Notifications
You must be signed in to change notification settings - Fork 1
/
Near Duplicates Chain Finder.txt
5 lines (3 loc) · 1.47 KB
/
Near Duplicates Chain Finder.txt
1
2
3
4
5
**Near Duplicates Chain Finder**
The Near Duplicates Chain Finder software is a Java program, which finds chains of near duplicate documents. It runs on Java platform 1.7 and can be used on Windows, Mac, UNIX, Linux, etc. It is an addition to the Near Duplicates Finder, which searches for near duplicate documents based on the internal text of a document. The Near Duplicates Finder works with different types of documents, including Plain Text, HTML, XML, PDF, Microsoft Office, OpenOffice, RTF, etc. Click here for more information about the Near Duplicates Finder. The chain is an ordered collection of documents, with a root document, sorted by document differences. The last document in a chain can be quite different from the first one, however the software allows you to see the chain of changes in one set.
The Chain Finder uses data produced by the Near Duplicates Cluster Finder. The chains are built based on text similarity of processed documents. The report shows two known issues: 1. Project Gutenberg documents have large license text added to each document. If the size of the content is small in relation to the size of the license, the software will report false near-duplicates, because of the matching license text in each document. 2. The Chinese encoding in text files is not properly processed, resulting Chinese text being discarded, which again produces false near-duplicates because of a long license text. The first two largest clusters display this problem in the report.