The first step towards any machine learning algorithm is to perform data analysis of the dataset involved. This notebook involves analysing the twitter dataset used to perform sentiment analysis.

### Table of Content

- [Imports and Configurations](#imports-and-configurations)
- [Importing the Dataset](#importing-the-dataset)
- [Analyzing Repeated Tweets](#analyzing-repeated-tweet-ids)
- [Analyzing Sentiments](#analyzing-sentiments)
- [Analyzing Entities](#analyzing-entities)
- [Sentiment Analysis](#sentiment-analysis)
- [Understanding Irrelevant Sentiments](#understanding-irrelevant-sentiments)
- [Missing Values](#missing-values)
- [Summary](#summary)

### Imports and Configurations 

In [1]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

In [2]:
import pyspark.pandas as ps
import pandas as pd

In [3]:
# VSCode was used for developing and testing this notebook
# The following code is necessary for plotly figure to render successfully on VSCode
# Uncomment the following lines for jupyter notebook if necessary
import plotly.io as pio
pio.renderers.default = "vscode"

### Importing the Dataset 

In [4]:
names = ["Tweet_ID","Entity","Sentiment","Tweet_Content"]
label = "Sentiment"

In [5]:
pdf_train = pd.read_table("./twitter_training.csv",names=names,index_col="Tweet_ID",sep=",")
pdf_valid = pd.read_table("./twitter_validation.csv",names=names,index_col="Tweet_ID",sep=",")

In [6]:
# The following code is supposed to supress warnings but it doesn't seem to work

# from pyspark import SparkContext
# sc = SparkContext()

# sc.setLogLevel("OFF")

In [7]:
df = ps.concat([
    ps.from_pandas(pdf_train),
    ps.from_pandas(pdf_valid)
])


iteritems is deprecated and will be removed in a future version. Use .items instead.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/12 12:49:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable



iteritems is deprecated and will be removed in a future version. Use .items instead.


iteritems is deprecated and will be removed in a future version. Use .items instead.


iteritems is deprecated and will be removed in a future version. Use .items instead.



In [8]:
df.head()

                                                                                

Unnamed: 0_level_0,Entity,Sentiment,Tweet_Content
Tweet_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2401,Borderlands,Positive,im coming on borderlands and i will murder you...
2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [9]:
print(f"Total Records", len(df))



Total Records 75682


                                                                                

### Analyzing repeated Tweet IDs 

When loading a dataset from a csv file it is good to use an ID column as index. That's what has been done in the code written above. IDs are normally unique, but the values under `Tweet_ID` seems to be repeating. This brings up two things:

1. Figure out why there are repeated values
2. Figure out if `Tweet_ID` can be used as index

In [10]:
counts = df.index.value_counts()
print(counts.value_counts())
counts.value_counts().plot.pie()

                                                                                

6    11447
7     1000
Name: Tweet_ID, dtype: int64


                                                                                

Every `Tweet_ID` seems to be repeating 6 or 7 times.

In [11]:
for tweet in df["Tweet_Content"][2401].to_numpy():
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.



im getting on borderlands and i will murder you all ,
I am coming to the borders and I will kill you all,
im getting on borderlands and i will kill you all,
im coming on borderlands and i will murder you all,
im getting on borderlands 2 and i will murder you me all,
im getting into borderlands and i can murder you all,


In [12]:
for tweet in df["Tweet_Content"][350].to_numpy():
    print(tweet)


`to_numpy` loads all data into the driver's memory. It should only be used if the resulting NumPy ndarray is expected to be small.



I played this interesting quiz on Amazon - Try your luck for a chance to win exciting rewards amazon.in/game/share/g8Mâ€¦
ve played this interesting quiz on Amazon - Try your luck for a chance to win exciting rewards amazon.in / game / share / g8M...
I played this interesting quiz on Amazon - Try your luck for a chance to win exciting rewards amazon.in / game / share / g8M...
I played this interesting lottery on Amazon - Try your luck earn a card to win exciting rewards amazon.in/game/share/g8Mâ€¦
I also played this interesting game quiz on Amazon - Try your luck today for finding a chance to help win exciting rewards amazon. in / game / share / and g8M â€¦
I played this interesting rewards game Amazon - so good luck for a day to win exciting rewards amazon.in/game/share/g8M...


Judging from the above two code cells. It seems that a `Tweet_ID` corresponds to a single tweet but it keeps records of edits to that tweet. It is still strange to see that all the tweets are modified 6 or 7 times. The data description from the source of the dataset doesn't mention any of this.

The tweets differ slightly with words and characters and can be used to see if small edits can lead to a different being predicted. Hence, all the tweets can be used for predicting sentiments and need to be assigned a different unique id.

In [13]:
# Sorting first by Tweet_ID
df = df.sort_index()
# Removing Tweet_ID as index and adding it as a proper column
df.reset_index(inplace=True)
df.head(10)

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content
0,1,Amazon,Negative,@amazon wtf .
1,1,Amazon,Negative,@ amazon wtf.
2,1,Amazon,Negative,@ amazon wtf.
3,1,Amazon,Negative,@amazon wtf?
4,1,Amazon,Negative,7 @amazon wtf.
5,1,Amazon,Negative,<unk> wtf.
6,2,Amazon,Negative,Iâ€™m really disappointed with amazon today! I o...
7,2,Amazon,Negative,I am really disappointed with Amazon today! I ...
8,2,Amazon,Negative,I'm really disappointed with amazon today! I o...
9,2,Amazon,Negative,Iâ€™m extremely disappointed with amazon today! ...


### Analyzing Sentiments 

In [14]:
print(df["Sentiment"].value_counts())
df["Sentiment"].value_counts().plot.pie()

Negative      22808
Positive      21109
Neutral       18603
Irrelevant    13162
Name: Sentiment, dtype: int64


Another interesting thing to analyze here is to check if all the edits for the same `Tweet_ID` have the same sentiment. This can be done as follows:

1. Group the records by `Tweet_ID`
2. Retrieve the `Sentiment` column of the group
3. Find out the unique sentiments in the group. If there is only a single sentiment for all the edits, then only one unique sentiment will be present for each group
4. Applying the len function to calculate number of unique sentiments for the group
5. Finding groups for which number of unique sentiments is not equal to 1.
6. Step 5 will result in True and False values which can be then summed to find out how many groups don't have a singe unique sentiment
7. If the result is 0 then all the edits have the same sentiment.

In [15]:
df.groupby(["Tweet_ID"])["Sentiment"].unique().apply(len).ne(1).sum()


iteritems is deprecated and will be removed in a future version. Use .items instead.


iteritems is deprecated and will be removed in a future version. Use .items instead.

                                                                                

0

As the result is 0 that means that the sentiment for all edits of a `Tweet_ID` is the same.

### Analyzing Entities 

The Entity column represents the entity for which the tweet content is for. The sentiment provided in the `Sentiment` column, or that analyzed by the lexicon based approaches performed in the `sentiment-analysis.ipynb` notebook can be used to analyze a generic public opinion on an entity.

This section focusses on analyzing different entities present in the dataset, and the next section analyzes the sentiments for these entities.

In [16]:
counts = df["Entity"].value_counts()

print("Number of Unique Entities",len(counts))
# counts

Number of Unique Entities 32


In [17]:
df["Entity"].value_counts().plot.pie()

From the above pie chart we can observe that the dataset in question contains nearly equal amounts of tweets for each entity.

Another thing that can be observed is that the entities are either brands or a video game

### Sentiment Analysis

This section can be taken as an example of how an organization can get a brief overview of public opinion towards them or their products.

In [18]:
counts = df.groupby(["Entity"])["Sentiment"].value_counts().sort_index().to_frame()

counts.columns = ["Count"]
counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
Entity,Sentiment,Unnamed: 2_level_1
Amazon,Irrelevant,195
Amazon,Negative,582
Amazon,Neutral,1254
Amazon,Positive,319
ApexLegends,Irrelevant,195
ApexLegends,Negative,606
ApexLegends,Neutral,959
ApexLegends,Positive,652
AssassinsCreed,Irrelevant,265
AssassinsCreed,Negative,382


In [19]:
counts = counts.unstack()["Count"]

# counts.head()

counts.plot.bar()

### Understanding Irrelevant Sentiments

Upon observing multiple tweet contents which are marked as "Irrelevant" it can be seen that these tweets are not relevant to the entity. The question is to whether remove these columns from sentiment analysis or keep them in the dataset.

In [20]:
df[df["Sentiment"]=="Irrelevant"].head(10)

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content
24,5,Amazon,Irrelevant,"I've purchased 10 times on her site, 2 times o..."
25,5,Amazon,Irrelevant,"I have bought 10 times on their site, 2 times ..."
26,5,Amazon,Irrelevant,"I've purchased 10 times on her site, 2 times o..."
27,5,Amazon,Irrelevant,"I've purchased 10 times into her site, 2 times..."
28,5,Amazon,Irrelevant,"Now I've just purchased 10 times on her site, ..."
29,5,Amazon,Irrelevant,"you've purchased 10 times over her site, 2 tim..."
142,26,Amazon,Irrelevant,Happy pub day @EachStarAWorld! ðŸ¥³.
143,26,Amazon,Irrelevant,Happy Pub Day @ EachStarAWorld!.
144,26,Amazon,Irrelevant,Happy pub day @ EachStarAWorld!.
145,26,Amazon,Irrelevant,Happy V day @EachStarAWorld! ðŸ¥³.


In [21]:
(df["Sentiment"]=="Irrelevant").sum()/len(df)

0.17391189450595915

Tweets with `Irrelevant` setiment amounts to around 17% of the dataset which is a big number. These rows can be ignored for checking the accuracy of `Vader Sentiment` and `TextBlob` but can be used to compare compound and polarity scores estimated by these two tools.

----

### Missing Values 

In [22]:
df[df.isna()["Tweet_ID"]]

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content


In [23]:
df[df.isna()["Entity"]]

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content


In [24]:
df[df.isna()["Sentiment"]]

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content


In [25]:
df[df.isna()["Tweet_Content"]].head()

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content
96,16,Amazon,Neutral,
97,16,Amazon,Neutral,
98,16,Amazon,Neutral,
211,37,Amazon,Neutral,
212,37,Amazon,Neutral,


Only the `Tweet_Content` column seems to be having missing values

In [26]:
df[df["Tweet_ID"]==16]
# Raw Data
# 16,Amazon,Neutral, 

# 16,Amazon,Neutral,It is not the first time that the EU Commission has taken such a step.

# 16,Amazon,Neutral,"At the same time, despite the fact that there are currently some 100 million people living below the poverty line, most of them do not have access to health services and do not have access to health care, while most of them do not have access to health care."

# 16,Amazon,Neutral,

# 16,Amazon,Neutral,

# 16,Amazon,Neutral,

                                                                                

Unnamed: 0,Tweet_ID,Entity,Sentiment,Tweet_Content
93,16,Amazon,Neutral,
94,16,Amazon,Neutral,It is not the first time that the EU Commissio...
95,16,Amazon,Neutral,"At the same time, despite the fact that there ..."
96,16,Amazon,Neutral,
97,16,Amazon,Neutral,
98,16,Amazon,Neutral,


It seems that the missing values are present because there is no content in some of the edits. It still shares the sentiment from tweet_content that has some valid.

It is safe to drop these records for sentiment analysis.

In [27]:
missing_count = df.isna()["Tweet_Content"].sum()
print("Records that will be dropped because of missing values: ",missing_count)

print(f"This is {missing_count/len(df):0.2%} of the total data")

Records that will be dropped because of missing values:  686
This is 0.91% of the total data


----

### Summary 

The following observations were made from the data analysis of the twitter sentiment analysis dataset

- There are 75682 records in the dataset
- There were multiple tweets with the same `Tweet_ID` and it was concluded that they contained edits of the same tweet
- This served as a good opportunity to analyze if the edits are of the same sentiment when predicted using TextBlob and Vader
- There are four types of Sentiment labels present in the dataset: Positive, Neutral, Negative, and irrelevant.
- The irrelevant sentiment label is put on tweets that are irrelevant to the corresponding entity
- There are 32 different entities which are related to video games and brand
- Missing values were only found for `Tweet_Content`. These accounted for 0.91% of the total data and hence are safe to be dropped from the dataset