Skip to content

abdullahalzubaer/Pairwise-Sample-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Pairwise-Sample-Similarity

Pairwise sample similarity (cosine) between records.

It is sometimes necessary to know how similar our data are compared to other data in the database. In this repository, I have written a program that will provide pair-wise similarity between the records.

For example, if we have data coming to the same database from different sources we might need to automate the process of how similar the samples are. Since sometimes we might have a similar kind of data and we do not want that, or it might be necessary to delete the duplicate (or close to duplicate) data.

The dataset already had numerical values, therefore reducing the trouble of encoding it (for example, from text to numerical values). It's future work :)

TODO

  • Provide proper documentaiton
  • Dataset characteristics
  • Try different simlarity measures
  • Work with text data and encode and then find the similarity

Dataset:

https://archive.ics.uci.edu/ml/datasets/covertype

Releases

No releases published

Packages

No packages published

Languages