Pairwise sample similarity (cosine) between records.
It is sometimes necessary to know how similar our data are compared to other data in the database. In this repository, I have written a program that will provide pair-wise similarity between the records.
For example, if we have data coming to the same database from different sources we might need to automate the process of how similar the samples are. Since sometimes we might have a similar kind of data and we do not want that, or it might be necessary to delete the duplicate (or close to duplicate) data.
The dataset already had numerical values, therefore reducing the trouble of encoding it (for example, from text to numerical values). It's future work :)
- Provide proper documentaiton
- Dataset characteristics
- Try different simlarity measures
- Work with text data and encode and then find the similarity