In this repository we provide sentiment analysis using a supervised machine learning method. In a previous project, we applied the VADER (Valence Aware Dictionary for sEntiment Reasoning), a sentiment intensity analyser implemented in NLTK to analyse our unlabeled amazon reviews data-set, we obtained a performance score of 71%. Please refer to my article on Vader Our main goal is the achieve a better performance in predicting positive and negative reviews.
For the vader classifier, we used labeld dataset consisting of 10000 reviews on Amazon products.
label | review | |
---|---|---|
0 | pos | Stuning even for the non-gamer: This sound tra... |
1 | pos | The best soundtrack ever to anything.: I'm rea... |
2 | pos | Amazing!: This soundtrack is my favorite music... |
3 | pos | Excellent Soundtrack: I truly like this soundt... |
4 | pos | Remember, Pull Your Jaw Off The Floor After He... |
- Exploring : our exploratory analysis of our data showed that there is a balance between positive and negative reviews.
- Cleaning and prepping : dealing with empty records and splitting data into train and test data-sets
- Feature Extraction: using the TF-IDF technique (term frequency-inverse document frequency) to measure the relevance of words in the reviews.
- Train and test: training and testing the SVM model using Scikit learn
- Visualizing the performance results: using matplotlib and seaborn to show the classification report and the confusion matrix by comparing our classification results with a gold standard (manual labels).
Our model achied a score of 87%. Our models struggle with identifying negative reviews could be due to sarcastic comments. This could be suject to further analysis.
#Visualizing Classification Report
predictions= my_model.predict(X_test)
report = classification_report(y_test,predictions, output_dict=True)
df_report = pd.DataFrame(report).transpose().round(2)
#df_report.style.background_gradient(cmap='greens').set_precision(2)
cm = sns.light_palette("green", as_cmap=True)
df_report.style.background_gradient(cmap=cm)
precision | recall | f1-score | support | |
---|---|---|---|---|
neg | 0.86 | 0.89 | 0.87 | 1649 |
pos | 0.89 | 0.85 | 0.87 | 1651 |
accuracy | 0.87 | 0.87 | 0.87 | 0.87 |
macro avg | 0.87 | 0.87 | 0.87 | 3300 |
weighted avg | 0.87 | 0.87 | 0.87 | 3300 |
# Visualizing the confision matrix
predictions=my_model.predict(X_test)
import matplotlib.pyplot as plt
import seaborn as sns
ax= plt.subplot()
cm=confusion_matrix(y_test,predictions)
sns.heatmap(cm, annot=True, fmt='g', ax=ax,cmap='Greens');
# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['neg', 'pos']); ax.yaxis.set_ticklabels(['neg', 'pos']);
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
pip install -r requirements.txt