# Sentiment analysis with naive-bayes

The aim of this notebook is to try to predict the sentiment and emotion of tweets using a naive Bayes classifier. The tweets will be in text form and the possible categories for this classification task will be *positive* or *negative*.

## Dataset

The dataset that we'll be using is the proven and tested dataset *sentiment140* consisting of 1.6 million tweets extracted using the twitter api. Half of the tweets are annotated with 'positive' and half of them are annotated with 'negative'. The methodology of this annotation is to detect tweets that use certain emoticons, use the corresponding emotion to categorize the tweet, and then remove the emoji from the text. 

The detailed approach can be found in the official paper: http://http//cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

## Importing Packages

In [2]:
# utility
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# nlp
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

# machine learning
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import classification_report

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Importing and Preprocessing Data

Because the naive Bayes classifier doesn't take into account the surrounding context of words, it makes sense to remove as much noise from the text as possible in this step.

In [3]:
# load dataset
df = pd.read_csv('datasets/sentiment140.csv', names=['sentiment', 'id', 'date', 'query', 'user', 'text'])
df = df.drop(['id', 'date', 'query', 'user'], axis=1)

# remove usernames
def remove_username(text):
    return ' '.join(word for word in text.split() if not word.startswith('@'))

# remove urls
def remove_url(text):
    return ' '.join(word for word in text.split() if not word.startswith('http') and not word.startswith('https')  and not word.startswith('www')) 

# remove non-alphabetic characters
def remove_nonalphabet(text):
    for char in text:
        if char not in 'abcdefghijklmnopqrstuvwxyz'+' ':
            text = text.replace(char, '')
    return text

# remove stopwords
def remove_stopwords(text):
    stop_words = stopwords.words('english')
    return ' '.join(word for word in text.split() if word not in stop_words)

# convert lowercase
def convert_lowercase(text):
    return text.lower()

# lemmatize
def lemmatize(text):
    return ' '.join(WordNetLemmatizer().lemmatize(word) for word in text.split())


def preprocess(text):
    text = convert_lowercase(text)
    text = remove_username(text)
    text = remove_url(text)
    text = lemmatize(text)
    text = remove_stopwords(text)
    text = remove_nonalphabet(text)
    return text

In [4]:
# apply preprocessing and save new dataset
df_preprocessed = df.copy()
df_preprocessed['text'] = df_preprocessed['text'].apply(preprocess)
df_preprocessed.to_csv('datasets/sentiment140_preprocessed.csv')

## Splitting into train and test set

In [5]:
X_train, X_test, y_train, y_test = train_test_split(df_preprocessed['text'], df_preprocessed['sentiment'], test_size=0.2, random_state=42)

## Tfid-Vectorizer

Additionally we need to vectorize our dataset, so that we can use our ML-models.

We train our Vectorizer on our training data

In [6]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectorizer.fit(X_train)

We transform our train and test sets using the vectoriser we created.

In [7]:
# transform
X_train = vectorizer.transform(X_train)
X_test  = vectorizer.transform(X_test)

## Naive-Bayes Classifier

In [8]:
# we are using a Bernoulli Naive Bayes model
clf = BernoulliNB().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.75      0.77    159494
           4       0.77      0.81      0.79    160506

    accuracy                           0.78    320000
   macro avg       0.78      0.78      0.78    320000
weighted avg       0.78      0.78      0.78    320000



* The training time for the naive Bayes classifier was extremely fast
* The metrics may not be the best, but an accuracy of 0.78 is not bad at all
* Especially when you consider the data we work with
    * The annotations of the tweets were automated and classified by the use of emojis that may not correlate with the text of the tweet
    * The classification of sentiment in positive and negative tweets doesn't model is very simplified
    * Doesn't account for the strength of the sentiment
    * Doesn't account for neutral tweets

## Test without preprocessing

Let's look at our model without the use our preprocessing.

In [9]:
# Let us test this without preprocessing
X_train2, X_test2, y_train2, y_test2 = train_test_split(df['text'], df['sentiment'], test_size=0.2, random_state=42)
vectoriser2 = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser2.fit(X_train2)
X_train2 = vectoriser2.transform(X_train2)
X_test2 = vectoriser2.transform(X_test2)

In [10]:
clf2 = BernoulliNB().fit(X_train2, y_train2)
y_pred2 = clf2.predict(X_test2)
print(classification_report(y_test2, y_pred2))

              precision    recall  f1-score   support

           0       0.80      0.79      0.80    159494
           4       0.80      0.81      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



Possible reasons why it's better without preprocessing
* 1-2 ngrams are looking at more than single words, removing stopwords might be bad
* Extensive removal of text is losing data

Let us test this without removing stop words.

In [127]:
def preprocess_no_stopwords(text):
    text = convert_lowercase(text)
    text = remove_username(text)
    text = remove_url(text)
    text = lemmatize(text)
    text = remove_nonalphabet(text)
    return text

df = pd.read_csv('datasets/sentiment140.csv', names=['sentiment', 'id', 'date', 'query', 'user', 'text'])
df = df.drop(['id', 'date', 'query', 'user'], axis=1)

df_preprocessed2 = df.copy()
df_preprocessed2['text'] = df_preprocessed2['text'].apply(preprocess_no_stopwords)

In [128]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(df_preprocessed2['text'], df_preprocessed2['sentiment'], test_size=0.2, random_state=42)
vectoriser2 = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser2.fit(X_train2)
X_train2 = vectoriser2.transform(X_train2)
X_test2 = vectoriser2.transform(X_test2)
clf2 = BernoulliNB().fit(X_train2, y_train2)
y_pred2 = clf2.predict(X_test2)
print(classification_report(y_test2, y_pred2))

              precision    recall  f1-score   support

           0       0.81      0.78      0.79    159494
           4       0.79      0.82      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



Not a meaningful difference to applying no preprocessing.
Conclusion: preprocessing for this task is probably not necessary. 

Note: Try Tweet-tokenizer from sklearn package

## Using official test data

There is manually labeled test data for this dataset. This would give us a more accurate evaluation and could improve our result.

In [20]:
df_test = pd.read_csv("datasets/testdata.csv", names=['sentiment', 'id', 'date', 'query', 'user', 'text'])

In [55]:
df_test = df_test.drop(df_test[df_test['sentiment'] == 2].index)

We are dropping all the rows that include the neutral sentiment (sentiment = 2) because there is no neutral sentiment in our training data.

In [58]:
X_test3 = df_test['text']
y_test3 = df_test['sentiment']
X_train3 = df['text']
y_train3 = df['sentiment']

In [59]:
vectoriser3 = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser3.fit(X_train3)
X_train3 = vectoriser3.transform(X_train3)
X_test3 = vectoriser3.transform(X_test3)

In [60]:
clf3 = BernoulliNB().fit(X_train3, y_train3)
y_pred3 = clf3.predict(X_test3)
print(classification_report(y_test3, y_pred3))

              precision    recall  f1-score   support

           0       0.84      0.81      0.83       177
           4       0.82      0.85      0.83       182

    accuracy                           0.83       359
   macro avg       0.83      0.83      0.83       359
weighted avg       0.83      0.83      0.83       359



There is a noticeable improvement with the hand-labeled test data compared to splitting the test data from our training data.