- LSTM model for calculating semantic similarity with Keras library and Quora dataset.
- Sentiment Analysis included
- Negation identification
- Extract keywords including noun chunks with multiple words, in addition to stop-words removal
- Classify questions types, questions starting with question words such as what, when, how much, ..etc
- Named entities detection using NLTK's tree2conlltags and Spacy
- Extract numbers
- WordNet Similarity is used with the special case of having two sentences with only one word difference which has a special function for "mini similarity check"
- Text normalization and part-of-speech tagging
Two English text strings
- Flag indicating whether the two sentences are similar or not
sim
- Total similarity score
sim_per
- Keywords similarity score
keywords_sim
- Semantic similarity score
keras
- Keywords
keywords
- Maximum number of keywords
max_keywords
- Named entities
entities
- Sentiment scores
sentiment
- Numbers
numbers
- Question types
class
and their flagf_class
from semsim import Semsim
model = Semsim()
q1="what is the cost of the shirt"
q2="how much does the shirt cost"
model.similar(q1,q2)
- similar
{'numbers': [[], []], 'keywords': [['cost', 'shirt'], ['cost', 'shirt']], 'max_keywords': 4, 'f_class': True, 'entities': [[], []], 'sentiment': [0.0, 0.0, 0.0], 'sim': 1, 'class': [5, 5], 'keras': 98.558107145608687, 'sim_per': 86.779053572804344, 'keywords_sim': 75.0}
q1="what is the cost of the shirt"
q2="how much does the shirt weigh"
model.similar(q1,q2)
- not similar
{'keywords': [['shirt', 'cost'], ['shirt', 'weigh']], 'sim_per': 45.730595498683286, 'max_keywords': 5, 'keywords_sim': 40.0, 'f_class': True, 'class': [5, 5], 'sentiment': [0.0, 0.0, 0.0], 'numbers': [[], []], 'entities': [[], []], 'keras': 51.461190997366565, 'sim': 0}