Skip to content

Calculating Document Similarity with Named Entity Recognition, TF-IDF, and Cosine Similarity (Python).

Notifications You must be signed in to change notification settings

albertpratomo/DocumentSimilarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Document Similarity with NER, TF-IDF, and Cosine Similarity

Step by step guide to calculate multiple documents similarity in Python.

We tokenize our texts along with Named Entity Recognition, in which we extract multi-word named entities such as "Los Angeles" as one token instead of two for better comparison accuracy.

Calculation of TF-IDF is used to create a vector model of each document based on the frequencies of its terms.

Cosine similarity measures the cosine of the angle between two vectors to calculate the degree of similarity between two vectors. In the context of calculating document similarity, our vectors are each and every one of our documents.

Further documentation can be found on the code here.

Albert Edwillian Pratomo
Martina Marcelline

About

Calculating Document Similarity with Named Entity Recognition, TF-IDF, and Cosine Similarity (Python).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published