<a href="https://colab.research.google.com/github/d-maniatakos/vaccine-sentiment-classifier/blob/master/vaccine_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Module Imports

In [230]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Data Import

In [200]:
# read datasets and ignore the first (index) column
train_data =  pd.read_csv('vaccine_train_set.csv').iloc[:, 1:]
validation_data = pd.read_csv('vaccine_validation_set.csv').iloc[:, 1:]

## Study Data

In [201]:
train_data['label'].value_counts()

0    7458
2    6445
1    2073
Name: label, dtype: int64

It's clear that most tweets in the training dataset express a neutral opinion on vaccines, a little less tweets express a positive opinion and the least tweets express a negative ones.

In [202]:
neutral_tweets = ' '.join(train_data[train_data.label== 1]['tweet'].tolist())
negative_tweets = ' '.join(train_data[train_data.label== 1]['tweet'].tolist())
positive_tweets = ' '.join(train_data[train_data.label== 2]['tweet'].tolist())

## Preprocessing

In [203]:
# def preprocess(text):
#   text = text.replace('@', '')
#   text = text.replace('#', '')
#   lemmatizer = WordNetLemmatizer()
#   lemmatized_text = ''
#   for token in text.split():
#     lemmatized_text += lemmatizer.lemmatize(token) + ' '

#   return lemmatized_text

# train_data['tweet'] = train_data['tweet'].apply(preprocess)
# validation_data['tweet'] = validation_data['tweet'].apply(preprocess)

## Vectorization

In [233]:
def create_vectorizer(train_corpus, method='tf-idf', max_features=1000, ngram_range=(1, 2)):
  if method == 'tf-idf':
    vectorizer = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range)
  elif method == 'bow':
    vectorizer = CountVectorizer(max_features=max_features, ngram_range=ngram_range)
  return vectorizer.fit(train_corpus)

def vectorize(vectorizer, corpus):
  return vectorizer.transform(corpus)

## Model Creation & Training

In [237]:
train_corpus = train_data['tweet'].tolist()
validation_corpus = validation_data['tweet'].tolist()


vectorizer = create_vectorizer(train_corpus)
train_vector = vectorize(vectorizer, train_corpus)
validation_vector = vectorize(vectorizer, validation_corpus)

train_features = pd.DataFrame(data=train_vector.toarray(), columns = vectorizer.get_feature_names_out())
train_labels = train_data['label'].values

classifier = LogisticRegression(multi_class="multinomial")
classifier.fit(train_features.to_numpy(), train_labels)

predicted_labels = classifier.predict(validation_vector.toarray())
precision = precision_score(validation_data['label'].tolist(), predicted_labels, average='weighted')
recall = recall_score(validation_data['label'].tolist(), predicted_labels, average='weighted')
f1 = f1_score(validation_data['label'].tolist(), predicted_labels, average='weighted')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [236]:
print('Precision: ' + str(precision))
print('Recall: ' + str(recall))
print('F1-Score: ' + str(f1))

Precision: 0.5976358182752302
Recall: 0.5438212094653813
F1-Score: 0.47762350485514354


In [219]:
my_text = "go vaccinate"
my_vector = vectorize(vectorizer, [my_text])

classifier.predict(my_vector)



array([2])