__Labeling Tweets__

To build a target variable column, i,e., to label the sentiment of the given text, we will compare several sentiment analyzer tools which are widely available for classifying the data.Through the remaining sections, we’ll compare and discuss classification results using several well-known NLP libraries in Python. The methods described below fall under five broad categories as below:
VADER, Textblob, SentiWordNet lexicon from NTLK, StanfordCoreNLP, Afinn
The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories which have been normalized between -1(most negative) and +1 (most positive). 

In [16]:
# import the necessary modules
import numpy as np
import pandas as pd

In [17]:
# import the csv file as Pandas dataframe
df = pd.read_csv("@tweets13.csv")

In [18]:
# shape of DataFrame
df.shape

(13915, 3)

In [19]:
# Drop All rows with missing values
df = df.dropna()


In [20]:
# Convert the created_at column to np.datetime object
df['Date of Tweet'] = pd.to_datetime(df['Date of Tweet'])

# Print created_at to see new format
print(df['Date of Tweet'].head())

# Set the index of ds_tweets to created_at
df.set_index('Date of Tweet', inplace = True)

0   2013-08-09 18:06:33
1   2010-09-02 06:23:41
2   2010-03-19 20:44:09
3   2017-01-14 18:40:51
4   2007-06-25 13:17:17
Name: Date of Tweet, dtype: datetime64[ns]


In [21]:
df['clean_text'] = df['clean_text'].astype(str)
df.head()

Unnamed: 0_level_0,year,clean_text
Date of Tweet,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-08-09 18:06:33,2013,Worked with Lila and Nate today at BGHS Online...
2010-09-02 06:23:41,2010,We did vote if you recall Hillary won the popu...
2010-03-19 20:44:09,2010,Remember Impeachment is just as much pa of the...
2017-01-14 18:40:51,2017,Funny how Republicans like Nikki Haley suddenl...
2007-06-25 13:17:17,2007,Is this the Electoral College vote


__1. SentiWordNet lexicon:__

First let's label the tweets as either positive, negative or neutral using SentiWordNet lexicon. Words are associated with a sentiment score included between -1 and 1. Words are in the form lemma#PoS and are aligned with WordNet lists that include adjectives, nouns, verbs and adverbs.

In [22]:
from nltk import pos_tag, map_tag
import time
import nltk
from nltk.corpus import sentiwordnet as swn
from nltk.tag import pos_tag,map_tag
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize 

pstem = PorterStemmer()
lem = WordNetLemmatizer()

df_copy = df.copy()
df_copy = df_copy.reset_index()
def pos_senti(df_copy):#takes
    li_swn=[]
    li_swn_pos=[]
    li_swn_neg=[]
    missing_words=[]
    for i in range(len(df_copy.index)):
        text = df_copy.loc[i]['clean_text']
        tokens = word_tokenize(str(text))
        tagged_sent = pos_tag(tokens)
        store_it = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in tagged_sent]
        #print("Tagged Parts of Speech:",store_it)

        pos_total=0
        neg_total=0
        for word,tag in store_it:
            if(tag=='NOUN'):
                tag='n'
            elif(tag=='VERB'):
                tag='v'
            elif(tag=='ADJ'):
                tag='a'
            elif(tag=='ADV'):
                tag = 'r'
            else:
                tag='nothing'

            if(tag!='nothing'):
                concat = word+'.'+tag+'.01'
                try:
                    this_word_pos=swn.senti_synset(concat).pos_score()
                    this_word_neg=swn.senti_synset(concat).neg_score()
                    #print(word,tag,':',this_word_pos,this_word_neg)
                except Exception as e:
                    wor = lem.lemmatize(word)
                    concat = wor+'.'+tag+'.01'
                    # Checking if there's a possiblity of lemmatized word be accepted into SWN corpus
                    try:
                        this_word_pos=swn.senti_synset(concat).pos_score()
                        this_word_neg=swn.senti_synset(concat).neg_score()
                    except Exception as e:
                        wor = pstem.stem(word)
                        concat = wor+'.'+tag+'.01'
                        # Checking if there's a possiblity of lemmatized word be accepted
                        try:
                            this_word_pos=swn.senti_synset(concat).pos_score()
                            this_word_neg=swn.senti_synset(concat).neg_score()
                        except:
                            missing_words.append(word)
                            continue
                pos_total+=this_word_pos
                neg_total+=this_word_neg
        li_swn_pos.append(pos_total)
        li_swn_neg.append(neg_total)

        if(pos_total!=0 or neg_total!=0):
            if(pos_total>neg_total):
                li_swn.append(1)
            else:
                li_swn.append(-1)
        else:
            li_swn.append(0)
    df_copy.insert(2,"pos_score",li_swn_pos,True)
    df_copy.insert(3,"neg_score",li_swn_neg,True)
    df_copy.insert(4,"sent_score",li_swn,True)
    return df_copy
    # end-of pos-tagging&sentiment
df3 = pos_senti(df_copy)

In [24]:
#counts of unique positive, negative and neutral values
df3.sent_score.value_counts()

 1    6785
-1    5264
 0    1865
Name: sent_score, dtype: int64

__2. AFINN:__

AFINN is a manually labeled by Finn Årup Nielsen in 2009–2011 list of English words rated for valence with an integer between minus five (negative) and plus five (positive) [5]

In [386]:
# Afinn sentiment LABELING
from afinn import Afinn
af = Afinn()
count_total=0
count_pos=0
count_neut=0

count_neg=0
li_af = []
for i in range(len(df_copy.index)):
    sent = str(df_copy.loc[i]['clean_text'])
    if(af.score(sent)>0):
        count_pos=count_pos+1
        count_total=count_total+1
        li_af.append(1)
    elif(af.score(sent)<0):
        count_neg=count_neg+1
        count_total=count_total+1
        li_af.append(-1)
    else:
        li_af.append(0)
        count_total=count_total+1
        count_neut+=1




print("Total tweets:",len(df_copy.index))
print("Total tweets with sentiment:",count_total)
print("positive tweets:",count_pos)
print("negative tweets:",count_neg)
print("neutral tweets:",count_neut)

Total tweets: 13914
Total tweets with sentiment: 13914
positive tweets: 5615
negative tweets: 3416
neutral tweets: 4883


__3. TextBlob:__

TextBlob is a popular Python library for processing textual data. It is built on top of NLTK, another popular Natural Language Processing toolbox for Python. TextBlob uses a sentiment lexicon (consisting of predefined words) to assign scores for each word, which are then averaged out using a weighted average to give an overall sentence sentiment score. Three scores: “polarity”, “subjectivity” and “intensity” are calculated for each word.

In [387]:
#TextBlob SENTIMENT LABELING
from textblob import TextBlob
count_total=0
count_pos=0
count_neg=0
count_neut=0

li_tb = []
for i in range(len(df_copy.index)):
    sent = TextBlob(str(df_copy.loc[i]["clean_text"]))
    if(sent.sentiment.polarity>0):
        count_pos=count_pos+1
        count_total=count_total+1
        li_tb.append(1)
    elif(sent.sentiment.polarity<0):
        count_neg=count_neg+1
        count_total=count_total+1
        li_tb.append(-1)
    else:
        li_tb.append(0)
        count_neut+=1

        count_total=count_total+1


print("Total tweets:",len(df_copy.index))
print("Total tweets with sentiment:",count_total)
print("positive tweets:",count_pos)
print("negative tweets:",count_neg)
print("neutral tweets:",count_neut)

Total tweets: 13914
Total tweets with sentiment: 13914
positive tweets: 5820
negative tweets: 2291
neutral tweets: 5803


__4. VADER:__

VADER (Valence Aware Dictionary and sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.Once VADER is installed SentimentIntensityAnalyser object will be called to classify texts as below:

In [25]:
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Convert the created_at column to np.datetime object
df6 = df.copy()

# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Generate sentiment scores
sentiment_scores = df6['clean_text'].apply(sid.polarity_scores)

In [26]:
df6["score"] = sentiment_scores.apply(lambda x: x['compound'])

In [27]:
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

count_total=0
count_pos=0
count_neg=0
count_neut=0


for i in df6["score"]:
    if i >0:
        count_pos=count_pos+1
    elif i <0:
        count_neg = count_neg +1
    else:
        count_neut = count_neut +1
        
 

print("positive tweets:",count_pos)
print("negative tweets:",count_neg)
print("neutral tweets:",count_neut)
conditions = [
    (df6['score'] >0),
    (df6['score'] <0),
    (df6['score'] == 0)]
choices = [1,-1,0]
df6['sentiment'] = np.select(conditions, choices )


positive tweets: 6123
negative tweets: 3813
neutral tweets: 3978


In [28]:
# print few rows

df6.head()

Unnamed: 0_level_0,year,clean_text,score,sentiment
Date of Tweet,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-08-09 18:06:33,2013,Worked with Lila and Nate today at BGHS Online...,0.6249,1
2010-09-02 06:23:41,2010,We did vote if you recall Hillary won the popu...,0.7579,1
2010-03-19 20:44:09,2010,Remember Impeachment is just as much pa of the...,0.0,0
2017-01-14 18:40:51,2017,Funny how Republicans like Nikki Haley suddenl...,0.7766,1
2007-06-25 13:17:17,2007,Is this the Electoral College vote,0.0,0


__5. StanfordCoreNLP:__
    
StanfordCoreNLP builds on grammatical structures.

In [415]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

df8 = df.copy()
def get_sentiment(text):
    res = nlp.annotate(text,
                       properties={'annotators': 'sentiment,tokenize,ssplit',
                                   'outputFormat': 'json',
                                   'timeout': 1000,
                       })
    return res['sentences'][0]['sentiment']



In [416]:
text_amb = "We did vote if you recall Hillary won the popular vote by over million votes"
get_sentiment(text_amb)

'Negative'

In [417]:
df8["sentiment"] =df8['clean_text'].map(get_sentiment)

In [418]:
pd.set_option('display.max_colwidth', -1)
df8.head(2)

Unnamed: 0_level_0,year,clean_text,sentiment
Date of Tweet,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-08-09 18:06:33,2013,Worked with Lila and Nate today at BGHS Online Lab Great conversations regarding Electoral College and life after high,Negative
2010-09-02 06:23:41,2010,We did vote if you recall Hillary won the popular vote by over million votes,Negative


In [419]:
df8.sentiment.value_counts( )

Negative        7650
Neutral         4941
Positive        1300
Verynegative    20  
Verypositive    3   
Name: sentiment, dtype: int64

In [421]:
df8["clean_text"].iloc[1]

'We did vote if you recall Hillary won the popular vote by over million votes'

In [422]:
df.dtypes

year          int64 
clean_text    object
dtype: object

In [118]:
conditions = [
    (df8['sentiment'] == 'Negative'),
    (df8['sentiment'] == 'Verynegative'),
    (df8['sentiment']== "Positive"),
    (df8['sentiment'] == "Verypositive" ),
    (df8['sentiment']== "Neutral")
]

choices = [-1,-1,1,1,0]
df8['sentiment_val'] = np.select(conditions, choices )


Now, let's compare all the sentiment analyzer tools:

In [119]:
print("StanfordCoreNLP:")
print(df8.iloc[3])

StanfordCoreNLP:
Date of Tweet    2017-01-14 18:40:51                                                                                               
year             2017                                                                                                              
clean_text       Funny how Republicans like Nikki Haley suddenly have all this energy for letting the people decide Trump fate when
sentiment        Negative                                                                                                          
sentiment_val    -1                                                                                                                
Name: 3, dtype: object


In [120]:
print("Vader:")
print(df6.iloc[3])

Vader:
year          2017                                                                                                              
clean_text    Funny how Republicans like Nikki Haley suddenly have all this energy for letting the people decide Trump fate when
score         0.7766                                                                                                            
sentiment     1                                                                                                                 
Name: 2017-01-14 18:40:51, dtype: object


In [121]:
print("SentiWordNet lexicon:")
print(df6.iloc[3])

SentiWordNet lexicon:
year          2017                                                                                                              
clean_text    Funny how Republicans like Nikki Haley suddenly have all this energy for letting the people decide Trump fate when
score         0.7766                                                                                                            
sentiment     1                                                                                                                 
Name: 2017-01-14 18:40:51, dtype: object


In [123]:
print("TextBlob:")
print(df_copy.iloc[3])

TextBlob:
Date of Tweet    2017-01-14 18:40:51                                                                                               
year             2017                                                                                                              
pos_score        0.25                                                                                                              
neg_score        0                                                                                                                 
sent_score       1                                                                                                                 
clean_text       Funny how Republicans like Nikki Haley suddenly have all this energy for letting the people decide Trump fate when
Name: 3, dtype: object


In [124]:
print("Afinn:")
print(df3.iloc[3])

Afinn:
Date of Tweet    2017-01-14 18:40:51                                                                                               
year             2017                                                                                                              
pos_score        0.25                                                                                                              
neg_score        0                                                                                                                 
sent_score       1                                                                                                                 
clean_text       Funny how Republicans like Nikki Haley suddenly have all this energy for letting the people decide Trump fate when
Name: 3, dtype: object


In [425]:
#df8.to_csv("@tweets_final.csv")

__Conclusion:__

StanfordCoreNLP will be used to label our tweets dataset since it is designed to help evaluate a model’s ability to understand representations of sentence structure, rather than just looking at individual words in isolation.