# Comparison of Classifiers Used for Sentiment Analysis

We should not use classifiers for sentiment analysis with production setting. Nonetheless, text-based sentiment analysis using classifiers is still a fun exercise as it's easy to understand. In this notebook, we explore the accuracy of the different classification models.

In [6]:
# Install the packages

!pip install pandas scikit-learn > /dev/null 2>&1

In [7]:
from collections import namedtuple
import time

import pandas as pd
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [8]:
# Load and preprocess data.
trained_data = pd.read_csv('../../data/twitter_data.csv')

# Clean the data.
trained_data = trained_data.dropna(subset=['text', 'category'])

# Vectorize the data
vec = CountVectorizer()
X = vec.fit_transform(trained_data['text'])
y = trained_data['category']

# Split the data into 2 datasets for training and testing.
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.25, random_state=42)

In [9]:
ModelRun = namedtuple('ModelRun', ['name', 'model'])

classifiers = [
    ModelRun('K-Nearest Neighbors', KNeighborsClassifier(3)),
    ModelRun('Logistic Regression', LogisticRegression(max_iter=3000, class_weight='balanced', solver='lbfgs')),
    ModelRun('Multinomial Naive Bayes', MultinomialNB()),
    ModelRun('Random Forest', RandomForestClassifier(max_depth=8, random_state=42)),
    ModelRun('C-Support Vector', SVC(gamma='auto')),  # Long time to run. Comment it out if you don't want to wait 43 mins on M series cpu.
]


sum_time = 0.0
for clf in classifiers:
    print(f'Running {clf.name}')
    
    start_time = time.time()
    clf.model.fit(X_train, y_train)
    y_pred = clf.model.predict(X_test)
    print(f'Accuracy: {accuracy_score(y_test, y_pred) * 100:.3f}')
    
    elapsed_time = time.time() - start_time
    sum_time += elapsed_time
    print(f'Time to run: {elapsed_time:.1f}s')
    print()


print(f'Total time ran: {sum_time}s')

Running K-Nearest Neighbors
Accuracy: 49.643
Time to run: 63.7s
Running Logistic Regression
Accuracy: 94.222
Time to run: 7.2s
Running Multinomial Naive Bayes
Accuracy: 74.243
Time to run: 0.0s
Running Random Forest
Accuracy: 44.057
Time to run: 13.2s
Running C-Support Vector
Accuracy: 44.057
Time to run: 2592.4s
Total time ran: 2676.4916076660156s
