Best cuts for predicting sentiments using TextBlob and Vader were estimated in the notebooks `textblob_tuning.ipynb` and `vader_tuning.ipynb` respectively. This notebook uses those cuts to predict and analyze the sentiments as estimated by the models. Apart from that this notebook also discusses how the subjectivity scores can be used to filter tweets on the basis of how subjective or objective they are.

### Table of Content

- [Imports and Configurations](#imports-and-configurations)
- [Importing the Dataset](#importing-the-dataset)
- [Estimating Vader Compound Scores](#estimating-vader-compound-score)
- [Estimating TextBlob polarity and subjectivity](#estimating-textblob-polarity-and-subjectivity)
- [Predicting Sentiment using TextBlob](#irrelevant-sentiments-text-blob)
  - [Irrelevant Sentiments](#irrelevant-sentiments-text-blob)
- [Predicting Sentiment using Vader](#predicting-sentiment-using-vader)
  - [Irrelevant Sentiments](#irrelevant-sentiments-vader)
- [Comparing Vader Compound Values with TextBlob Polarity](#comparing-vader-compound-values-with-textblob-polarity)
- [Working with Subjectivity](#working-with-subjectivity)
- [Conclusion](#conclusion)

### Imports and Configurations

In [1]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

In [2]:
import pyspark.pandas as ps
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

### Table of Content

### Importing the Dataset

In [3]:
names = ["Tweet_ID","Entity","Sentiment","Tweet_Content"]
label = "Sentiment"

In [4]:
pdf_train = pd.read_table("./twitter_training.csv",names=names,sep=",")
pdf_valid = pd.read_table("./twitter_validation.csv",names=names,sep=",")

In [5]:
df = ps.concat([
    ps.from_pandas(pdf_train),
    ps.from_pandas(pdf_valid)
])

  fields = [
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/12 13:40:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/12 13:40:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():


In [6]:
df.dropna(inplace=True)
df.head()

                                                                                

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


### Estimating Vader Compound Score

In [7]:
analyzer = SentimentIntensityAnalyzer()

In [8]:
def getVaderSentimentScore(tweet):
    result = analyzer.polarity_scores(tweet)
    return result["compound"]

In [9]:
df["VADER_compound"] = df[label].apply(getVaderSentimentScore)

  fields = [
  for column, series in pdf.iteritems():


In [10]:
df.head()

                                                                                

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content,VADER_compound
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,0.5574
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,0.5574
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,0.5574
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,0.5574
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,0.5574


As done in the `data-analysis.ipynb` notebook, the following code checks if the compound score estimated by vader is the same for all edits of a tweet corresponding to a single `Tweet_ID`.

In [11]:
# Pandas API on PySpark produces a lot of warning messages for the following line of code
# Hence output is collected in a variable and the output of this cell is collapsed
# The output is printed in the next cell
output = df.groupby(["Tweet_ID"])["VADER_compound"].unique().apply(len).ne(1).sum()

  fields = [
  for column, series in pdf.iteritems():
                                                                                

In [12]:
output

0

The output is 0 hence all the edits of a single `Tweet_ID` received the same `compound` score from vader.

### Estimating TextBlob polarity and subjectivity

In [13]:
def getTextBlobPolarity(line):
    analyzer = TextBlob(line)
    return analyzer.sentiment.polarity


In [14]:
df["TEXTBLOB_polarity"] = df[label].apply(getTextBlobPolarity)

  fields = [
  for column, series in pdf.iteritems():


In [15]:
df.head()



Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content,VADER_compound,TEXTBLOB_polarity
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,0.5574,0.227273
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,0.5574,0.227273
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,0.5574,0.227273
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,0.5574,0.227273
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,0.5574,0.227273


For polarity all the edits for the same `Tweet_ID` seem to be having the same score

In [16]:
output = df.groupby(["Tweet_ID"])["TEXTBLOB_polarity"].unique().apply(len).ne(1).sum()

  fields = [
  for column, series in pdf.iteritems():
                                                                                

In [17]:
output

0

In [18]:
def getTextBlobSubjectivity(line):
    analyzer = TextBlob(line)
    return analyzer.sentiment.subjectivity

In [19]:
df["TEXTBLOB_subjectivity"] = df[label].apply(getTextBlobSubjectivity)

  fields = [
  for column, series in pdf.iteritems():


For subjectivity all the edits for the same `Tweet_ID` seem to be having the same score

In [20]:
output = df.groupby(["Tweet_ID"])["TEXTBLOB_subjectivity"].unique().apply(len).ne(1).sum()

  fields = [
  for column, series in pdf.iteritems():
                                                                                

In [21]:
output

0

### Predicting Sentiment using Vader

The following code is different from what was written in `vader_tuning.ipynb`. Here, the thresholds used for predicting sentiment are hard coded values that have been tuned in the aforementioned notebook.

In [22]:
def getVaderSentiment(score):
    if score>=0.5:
        return "Positive"
    elif score >=-0.5:
        return "Neutral"
    return "Negative"

In [23]:
df["VADER_sentiment"] = df["VADER_compound"].apply(getVaderSentiment)

  fields = [
  for column, series in pdf.iteritems():


In [24]:
df.head()

                                                                                

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content,VADER_compound,TEXTBLOB_polarity,TEXTBLOB_subjectivity,VADER_sentiment
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,0.5574,0.227273,0.545455,Positive
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,0.5574,0.227273,0.545455,Positive
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,0.5574,0.227273,0.545455,Positive
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,0.5574,0.227273,0.545455,Positive
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,0.5574,0.227273,0.545455,Positive


#### Irrelevant Sentiments Vader

Checking Accuracy for rows whose sentiment is not `Irrelevant`

In [25]:
not_irrelevant = df[df[label]!="Irrelevant"]
output = ((not_irrelevant[label] == not_irrelevant["VADER_sentiment"]).sum())/len(not_irrelevant)

                                                                                

In [26]:
output

1.0

Vader sentiment predicted all labels accurately

### Predicting Sentiments using TextBlob

Doing the same with TextBlob

In [27]:
def getTextBlobSentiment(polarity):
    if polarity>=0.2:
        return "Positive"
    elif polarity >=-0.2:
        return "Neutral"
    return "Negative"

In [28]:
df["TEXTBLOB_sentiment"] = df["TEXTBLOB_polarity"].apply(getTextBlobSentiment)

  fields = [
  for column, series in pdf.iteritems():


#### Irrelevant Sentiments Text Blob

In [29]:
not_irrelevant = df[df[label]!="Irrelevant"]
output = ((not_irrelevant[label] == not_irrelevant["TEXTBLOB_sentiment"]).sum())/len(not_irrelevant)

                                                                                

In [30]:
output

1.0

TextBlob also predicted all labels accurately

In [31]:
df.head()

                                                                                

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content,VADER_compound,TEXTBLOB_polarity,TEXTBLOB_subjectivity,VADER_sentiment,TEXTBLOB_sentiment
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,0.5574,0.227273,0.545455,Positive,Positive
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,0.5574,0.227273,0.545455,Positive,Positive
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,0.5574,0.227273,0.545455,Positive,Positive
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,0.5574,0.227273,0.545455,Positive,Positive
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,0.5574,0.227273,0.545455,Positive,Positive


### Comparing Vader Compound values with TextBlob polarity

In [32]:
df.plot.scatter(x="VADER_compound",y="TEXTBLOB_polarity")

                                                                                

From the scatter plot we can observe that there are four unique values estimated for `TextBlob_polarity` and three unique values estimated for `VADER_compound`.

For 0 value of `VADER_compound`, `TEXTBLOB_polarity` seems to have two values 0, and -0.5. This is counter intuitive as this will result in different sentiments, but both vader and textblob have seem to predict all sentiments accurately for rows not having `Irrelevant` sentiment.

In [33]:
df_differ = df[df["VADER_sentiment"]!= df["TEXTBLOB_sentiment"]]
length = len(df_differ)
only_irrelevant = (df_differ["Sentiment"].unique().to_list() == ["Irrelevant"])


`to_list` loads all data into the driver's memory. It should only be used if the resulting list is expected to be small.

                                                                                

In [34]:
print(f"Number of rows for which vader and textblob have different sentiments {length}")
print(f"Do these rows only contain the sentiment `Irrelevant`? {only_irrelevant}")

Number of rows for which vader and textblob have different sentiments 13047
Do these rows only contain the sentiment `Irrelevant`? True


### Working with Subjectivity

Subjectivity does not seem like a field to predict sentiments. Rather it seems like a value that can be used to filter sentiments by an organization. An organization might want to view tweets or their sentiments on the basis of how objective or subjective a tweet is.

Plotting a histogram to understand the subjectivity of tweets.

In [35]:
df["TEXTBLOB_subjectivity"].hist()

                                                                                

Subjectivity can be used to find tweets in which the user has expressed their personal feelings as shown below

In [46]:
for tweet in df[df["TEXTBLOB_subjectivity"]>0.5].sample(frac=0.01).head()["Tweet_Content"].to_numpy():
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.



How the hell are we already into Halloween month?!.
@ UnitedHayze to have my efforts noticed; 4-way Borderlands Co-Op would be amazing! pic.facebook.com / vcgNcpMavu
My Herney friend is also a fantastic artist.
Friendly Dog<unk> all 3's audio is an absolute mess and any song is by is probably my favorite gag in the entire game.. youtu.be/nFwAmB-tluI
. LIVE NOW!.. I feel like I have not streamed in vehicle problem because I have, sorry about that, more Borderlands for a bit with @ eyesofness.... Borderlands...


                                                                                

Or tweets can  be filtered on the basis of how objective they are as shown below

In [47]:
for tweet in df[df["TEXTBLOB_subjectivity"]<0.5].sample(frac=0.01).head()["Tweet_Content"].to_numpy():
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.



I hate that this easy mayhem modifier event on mayhem won't last forever. this is the most fun i've had in the game since they added them horrible modifiers. @ Borderlands please give me the option to play mayhem 10 but turn the modifiers off PLS
Elie to direct 'The Frontier' film
<unk><unk> was directing its ‘Borderlands’ movie engadget.com/2020/02/20/eli...
Ok, so for every major advancement I see in Borderlands 3 there's a minor annoyance. Not bad per se but just there are things I hate. Like the characterization of side characters.
@Borderlands Can we please fix the lag, audio issues, and crashes on PS4 after the update. It is VERY hard to play with 10 fps or less. Don't want this limited Cartel to run out and not be able to play properly. Also, MENUS! Fix the MENU LAG! Please!


                                                                                

The cut for subjectivity can be reduced even further to filter more objective tweets

In [48]:
for tweet in df[df["TEXTBLOB_subjectivity"]<0.1].sample(frac=0.01).head()["Tweet_Content"].to_numpy():
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.



1 Morning~!!. I'm split on playing PSO2 or Borderlands 3 for todays stream. 🤔. Either way a stream today is happening and I'm excited to hang out nonetheless!!. Hope your weekend was well. .  :3 pic.twitter.com/X0CrvuK8Pq
That would be a fantastic casting, but somehow it doesn't alleviate my sense of dread a bit.
This would be an amazing casting, and yet somehow, it doesn't ease my impending sense of that one thing.
Xbox One Anointed Trained Ripper Consecutive Hits 1% LVL50 OP Borderlands 3 dlvr.it/RMdFZP  
When I have completed my combat pass and the officer rank is a challenge every season,


                                                                                

Objective tweets can also be filtered for a specific entity as shown below

In [53]:
entity = "Amazon"
df_entity = df[df["Entity"]==entity]
for tweet in df_entity[df_entity["TEXTBLOB_subjectivity"]<0.1].sample(frac=0.01).head()["Tweet_Content"].to_numpy():
    print(f"Tweet for Entity: {entity}")
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.


Tweet for Entity: Amazon
Amazon's top-of-the-line Kindle is on sale for 30% more for a limited week. 1.today.com/3lMEXiR
Tweet for Entity: Amazon
Amazon has the newest, coolest Tile trackers on sale for up to 24% off dlvr.it/RRXj4Z
Tweet for Entity: Amazon
@Bladebattler92 Thanks for entering Grand Summoners  . . Watch the video to see if you won a $100 Amazon gift card! . Retweet everyday for another chance to win!. . . Play now for a FREE . 5 Yu Yu Hakusho Unit! .  https://t.co/z1TA7G1hbR
Tweet for Entity: Amazon
I played this interesting quiz on at Amazon - Try your luck enough for half a chance to personally win exciting rewards in amazon. Amazon in / game / share / g2H …
Tweet for Entity: Amazon
RT @ richardturrin: Amazon and Goldman partners. Perfect BaaS strategy!.. thefinancialbrand.com / 92681 / marcus-g.... @ BrettKing @ leimer @ psb _ dc @ ipfconline1 @ UrsBolt @ cgledhill @ rshevlin @ thepsironi @ karunk @ spirosmargaris @ jaypalter @ jimmarous @ efipm.


                                                                                

A histogram showing different levels of subjectivity for a particular entity can also be plotted as shown below

In [56]:
df_entity["TEXTBLOB_subjectivity"].hist(title=f"Subjectivity distribution for {entity}")

                                                                                

The same can be configured for any level of subjectivity by just modifying the cuts

### Conclusion

The twitter sentiment analysis dataset was loaded in a Pandas API on Spark dataframe. Using the cuts estimated sentiments were predicted. The predicted sentiments had a 100% accuracy if not considering Irrelevant sentiments.

As discussed in the `data_analysis.ipynb` notebook the scores estimated for various tweet edits turned out to be the same.

TextBlob and Vader predicted the same sentiments except for tweets labelled with `Irrelevant` sentiment where the sentiments predicted were a bit different.

Finally, subjectivity scores estimated by TextBlob were used to filter tweets for better analysis of subjective and objective tweets.