Deduplication using simhash.

What is represented here is a simplish code that demonstrates the concept of feature extraction and how near duplication is represented in the simhashed scores.

One seminal source which brings together everything is Detecting Near-Duplicates for Web Crawling by Manku et al.

This code includes another project which implements the simhashing algorithm as described in the Manku paper.

Requirements

To use the code you'd need:

Nltk
Simhash from the above project.
It's written in python. So preferebly run it on Linux.

To use it

Place whatever text files you want in the corpus directory. All of them will be read and each one will be mapped against the other to show the similarity score.
The file to run is hashtest.py.
The result table will pop up in results.csv.
To intepret the results. Values that are closer to 0 shows that the two files are similar, and further apart indicates differences.

Inferences

Removing stopwords and stemming give better results of differences.
When trying to hash on the n-gram or shingles, the results become stricter and even small changes are given high differences.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
corpus		corpus
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hashtest.py		hashtest.py
readFiles.py		readFiles.py
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deduplication using simhash.

Requirements

To use it

Inferences

About

Releases

Packages

Languages

License

anithm/nearduplicate

Folders and files

Latest commit

History

Repository files navigation

Deduplication using simhash.

Requirements

To use it

Inferences

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages