# Sentiment Analysis using Naive Bayes
In this casestudy, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier, which has already been studied by you. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes. There are several elegant libraries for this problem, one of which will be briefly introduced in this notebook later. <br> 
<br> 
**Note:** Since Naive Bayes is a basic algorithm, a couple of very useful sklearn features have been incorporated in this assignment. They will help you write much more robust and clean code, and are applicable to any ML code you write. <br> 
References:
1. Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
2. GridSearch: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

# Part 1 - Naive Bayes 

## 1. Importing required libraries 

In [1]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import nltk
from sklearn.naive_bayes import GaussianNB

# import pipeline, CountVectorizer, TfidfTransformer, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV

## 2. Reading dataset 

In [2]:
data=pd.read_csv('tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()
data = data.dropna()
data = data.reset_index(drop=True)

In [3]:
data

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1
5,Video shows federal officials joking about cos...,0
6,"one Chicago kid who says ""Obama is my man"" tel...",0
7,"RT @ohgirlphrase: American kid ""You're from th...",0
8,A valid explanation for why Obama won't let wo...,1
9,President Obama &lt; Lindsay Lohan RUMORS begi...,0


## 3. Text processing for the tweets

In [4]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet1 (tweet):
    # tweet is the text we will pass for preprocessing 
    
    # convert passed tweet to lower case 
    tweet = tweet.lower ()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet 
    tweet = word_tokenize(tweet)
    
    return [word for word in tweet if word not in stopwords]

In [5]:
def processTweet (tweet):
    tweet1 = ""
    tweet2 = ""
    tweet3 = ""
    tweet = tweet.lower ()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', ' ', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', ' ', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r' ', tweet) # remove the # in #hashtag   

    for char in tweet:
        if char not in punctuation:
            tweet1 = tweet1 + char    

    tweet2 = tweet1.split ()
   
    # Remove stopwords
    for word in tweet2:
        if word not in stopwords:
            tweet3 = (tweet1 + word)
            tweet3 = (tweet1 + " ")
    return (tweet3)

### Process all tweets 

In [6]:
data = data.dropna()
data = data.reset_index(drop=True)

processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned = processTweet (tweet)
    processed.append(cleaned)

In [7]:
print (processed)



In [8]:
data['processed']=processed

In [9]:
data

Unnamed: 0,tweets,labels,processed
0,Obama has called the GOP budget social Darwini...,1,obama has called the gop budget social darwini...
1,"In his teen years, Obama has been known to use...",0,in his teen years obama has been known to use ...
2,IPA Congratulates President Barack Obama for L...,0,ipa congratulates president barack obama for l...
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0,rt his connection to supporters of critic...
4,RT @wardollarshome: Obama has approved more ta...,1,rt obama has approved more targeted assassin...
5,Video shows federal officials joking about cos...,0,video shows federal officials joking about cos...
6,"one Chicago kid who says ""Obama is my man"" tel...",0,one chicago kid who says obama is my man tells...
7,"RT @ohgirlphrase: American kid ""You're from th...",0,rt american kid youre from the uk ohhh cool ...
8,A valid explanation for why Obama won't let wo...,1,a valid explanation for why obama wont let wom...
9,President Obama &lt; Lindsay Lohan RUMORS begi...,0,president obama lt lindsay lohan rumors beginn...


## 4. Create pipeline and define parameters for GridSearch

In [10]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## 5. Split data into test and train

In [11]:
# split data into train and test with split as 0.2 

X, y = data.processed, data.labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
X_train

451                           for 10 id prank call obama 
532         gun sales booming doomsday obama or zombie...
243          exposing the obamasoetoro deception   via   
370     yes   rt   this   hashtag is entertaining is t...
1355    i bet obama didnt think about calculus when he...
1217    rt   that was one of the strangest days ever w...
1304    rt   spike lee said he dislikes blackwhite cou...
462     rt   obama says   passed by strong majority of...
1240    to obama legal precedents are all about politi...
54      barack obama longboard package core 7 trucks 7...
585     rt   i am worried that barack obama is in bed ...
1322    despite rising gas prices energy policy expert...
109     rt   arianna huffington blasting president oba...
923     rt   photo of the day first lady michelle obam...
566         would not have to come out and defend wome...
1058      that was paul ryans budget how did obamas bu...
1050      lol rt   mannnnn hope dey pass dat gas cap t...
844     rt   o

## 6. Perform classification (using GridSearch) 

In [13]:
grid = GridSearchCV(text_clf, cv=10, n_jobs=-1, param_grid=tuned_parameters)

In [14]:
grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2), (2, 2)], 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': [1, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
print ("Score = %3.2f" %(grid.score(X_test, y_test)))

Score = 0.83


In [16]:
print (grid.best_params_)

{'clf__alpha': 0.1, 'tfidf__norm': 'l2', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


### Classification report 

In [17]:
# print classification report after predicting on test set with best model obtained in GridSearch
y_pred = grid.predict(X_test)
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.87      0.93      0.90       195
     class 1       0.67      0.67      0.67        61
     class 2       1.00      0.26      0.42        19

   micro avg       0.83      0.83      0.83       275
   macro avg       0.85      0.62      0.66       275
weighted avg       0.83      0.83      0.81       275



## Important and interesting insight:

In [18]:
counts = data.labels.value_counts()
print(counts)

0    942
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, you could use SMOTE (https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare. 

# Part 2 - VADER sentiment analysis

**Valence Aware Dictionary and Sentiment Reasoner (VADER)** is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER does not requires any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon. (A sentiment lexicon is a list of lexical features e.g., words, which are generally labelled according to their semantic orientation as either positive or negative.). VADER has been found to be quite successful when dealing with social media texts, editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

[Original Paper](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) <br> 
<br> 
( Install the library using `pip install vaderSentiment`)

In [19]:
!pip install vaderSentiment





In [20]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [21]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{} {}".format(sentence, str(score)))

Let's see how it performs on a custom sentence

In [22]:
sentiment_analyzer_scores("VADER is smart, handsome, and funny.")

VADER is smart, handsome, and funny. {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}


1. The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. This means our sentence was rated as 75% Positive, 25% Neutral and 0% Negative. Hence all these should add up to 1.

2. The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate. 

        positive sentiment: compound score >= 0.05
        neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
        negative sentiment: compound score <= -0.05

### 1. Punctuation

The use of an exclamation mark(!), increases the magnitude of the intensity without modifying the semantic orientation. For example, “The food here is good!” is more intense than “The food here is good.” and an increase in the number of (!), increases the magnitude accordingly.

In [23]:
# Baseline sentence
sentiment_analyzer_scores("The food here is good")

The food here is good {'neg': 0.0, 'neu': 0.58, 'pos': 0.42, 'compound': 0.4404}


In [24]:
# Punctuation
print(sentiment_analyzer_scores("The food here is good!"))
print(sentiment_analyzer_scores("The food here is good!!"))
print(sentiment_analyzer_scores("The food here is good!!!"))

The food here is good! {'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.4926}
None
The food here is good!! {'neg': 0.0, 'neu': 0.534, 'pos': 0.466, 'compound': 0.5399}
None
The food here is good!!! {'neg': 0.0, 'neu': 0.514, 'pos': 0.486, 'compound': 0.5826}
None


### 2. Capitalization
Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity. For example, “The food here is GREAT!” conveys more intensity than “The food here is great!”

In [25]:
# Baseline sentence
sentiment_analyzer_scores("The food here is great!")

The food here is great! {'neg': 0.0, 'neu': 0.477, 'pos': 0.523, 'compound': 0.6588}


In [26]:
# Capitalisation
sentiment_analyzer_scores("The food here is GREAT!")

The food here is GREAT! {'neg': 0.0, 'neu': 0.438, 'pos': 0.562, 'compound': 0.729}


### 3. Conjunctions
Use of conjunctions like “but”, signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The food here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating.

In [27]:
# Baseline sentence
sentiment_analyzer_scores("The food here is great")

The food here is great {'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6249}


In [28]:
# Conjunctions
sentiment_analyzer_scores("The food here is great, but the service is horrible")

The food here is great, but the service is horrible {'neg': 0.31, 'neu': 0.523, 'pos': 0.167, 'compound': -0.4939}
