A HackerEarth’s ML Challenge based on Detecting sentiments of image quote using unsupervised Machine Learning techniques and knowledge of OCR and NLP to build a model that analyzes sentiments of a quote and classifies them into positive
, negative
, or random
.
- Used glob to extract all the paths for the images
- In one approach i used cv2 and pytesseract library (images needs to be preprocessed using cv2 library) to read the text from the .jpg files.
- And in other i used easyocr library (gave better results) to read the text from the .jpg files.
- Removed \n and contraction then applied some spelling correction and created Sentiment scores using
SentimentIntensityAnalyzer
&TextBlob Sentiment polarity score
- Removed Non alphabets, Stopwords and applied lemmatization
- Created TFIDF vectors for the text
- Created vector represenation for each comment
- Stacked all these features i.e.
Sentiment scores
,TFIDF Vectors
,Comment Vectors
And used these as features to cluster the comments in 3 groups usingKmeans
.- Further by inspection detected the comment types in the cluster and mapped
Random
,Negative
,Positive
accordingly.