# Sentiment Analysis Notebook

In this notebook we explore new findings about the collected Reddit data. The data includes posts and comments which belong to posts. Comments can have a threaded structures.

In [None]:
import numpy
import pandas as pd
import plotly.express as px

### Importing Data

In [None]:
data_path = "../data/"
posts_df = pd.read_csv(data_path + "posts_sentiment.csv")
comments_df = pd.read_csv(data_path + "comments_sentiment.csv")

## Taking a glimpse into the data

Lets look at how much data we have for each subreddit.

In [None]:
posts_count = posts_df.groupby("subreddit").size().reset_index(name="post_count")
comments_count = comments_df.groupby("subreddit").size().reset_index(name="comment_count")
merged_df = pd.merge(posts_count, comments_count, on="subreddit", how="outer")
merged_df.fillna(0, inplace=True)

merged_df

### Checking for null values

In [84]:
posts_df.isnull().sum()

_id                0
author             0
permalink          0
post_id            0
sentiment.label    0
sentiment.score    0
subreddit          0
title              0
dtype: int64

In [85]:
comments_df.isnull().sum()

_id                   0
author                0
parent_id          7897
post_id               0
sentiment.label       1
sentiment.score       1
subreddit             0
text                233
thing_id              0
upvotes               0
dtype: int64

- Some comments seem to not have a text
- One observation of comments seems to not have a sentiment

Lets look into those points.

## Missing-Value-Handling and furhter explorations

Taking a look into authors of empty comments:

In [86]:
comments_df[comments_df["text"].isnull()]["author"].value_counts()

author
AutoModerator          149
SquatCorgiLegs           2
todang                   2
mohiben                  2
han_bylo                 2
                      ... 
WallStreetDoesntBet      1
howiecash                1
Sir-Kevly                1
Overweighover            1
Uchihagod53              1
Name: count, Length: 81, dtype: int64

The amount of isnull comments by the user "AutoModerator", which is a bot, is fairly high. We primarily want to do research on user-generated content, so we will drop all comments by the user "AutoModerator". Otherwise there doesn't seem to be a biased amount of comments by any other user.

In [87]:
comments_df = comments_df[comments_df["author"] != "AutoModerator"]

In [88]:
comments_df[comments_df["text"].isnull()]["subreddit"].value_counts()

subreddit
nba                     11
wallstreetbets          10
shitposting              9
memes                    8
Unexpected               8
pcmasterrace             6
mildlyinfuriating        6
pics                     5
WhitePeopleTwitter       4
gaming                   3
worldnews                3
LivestreamFail           3
HonkaiStarRail           2
news                     2
leagueoflegends          2
Damnthatsinteresting     1
therewasanattempt        1
Name: count, dtype: int64

In [89]:
# Get number of isnull comments per subreddit
isnull_comments = comments_df[comments_df["text"].isnull()]["subreddit"].value_counts()

all_comments = comments_df["subreddit"].value_counts()

isnull_comments = isnull_comments.reindex(all_comments.index, fill_value=0)

# Plot percentage of isnull_comments per subreddit
fig = px.bar(x=isnull_comments.index, y=(isnull_comments.values/all_comments.values)*100,
             title="Percentage of empty comments per subreddit",
             labels={"x": "Subreddit", "y": "Percentage (%) of isnull comments"})
fig.show()

Now lets take a closer look to why some of these posts could be empty.

In [90]:
# Take random seeded sample of 3 isnull comments
comments_df[comments_df["text"].isnull()].sample(n=3, random_state=1337)

Unnamed: 0,_id,author,parent_id,post_id,sentiment.label,sentiment.score,subreddit,text,thing_id,upvotes
4376,645e9b48c22e1f3b9b20a1f7,JakeWilling,,13fs28d,negative,0.3473,LivestreamFail,,t1_jjwik8n,-15
9819,645ea5b2c22e1f3b9b20ba46,FPS-owner97,t1_jjujhb5,13fedlp,negative,0.3473,memes,,t1_jjujvsr,79
3808,645e9a3dc22e1f3b9b209f6f,CoffeeDreamsLite,,13f0qqa,negative,0.3473,mildlyinfuriating,,t1_jjtmlhe,20


Now, lets look at those three random samples:

#### Case study 1 - Hidden by negative downvotes on r/LiveStreamFail
The first comment seems to have been **hidden by reddit** as it has received a high number of downvotes.

<div style="text-align: center;">
  <img src="./assets/r_lsf_hidden_comment.png" alt="Hidden Comment on r/LiveStreamFail" style="width: 10%">
</div>

If we extend the comment manually, while logged in, we can see the comment. This is something that we were not able to avoid in the scraping process as reddit does not allow anonymous users to expand hidden comments - one would need to be logged in to use that functionality.

#### Case study 2 - GIFs as comments on r/memes
The second comment seems to only contain a **GIF**, this is definitely something we wouldn't want to analyze as it does not contain any text.

<div style="text-align: center;">
  <img src="./assets/r_memes_giphy.png" alt="GIF Comment on r/memes" style="width: 10%">
</div>

The scraper itself would also not be able to extract text from GIFs, so we will drop all comments that only contain GIFs.

#### Case study 3 - GIFs as comments on r/mildlyinfuriating
The third comment seems to have the same issue as the comment in the second case study:

<div style="text-align: center;">
  <img src="./assets/r_mildlyinfuriating_giphy.png" alt="GIF Comment on r/memes" style="width: 10%">
</div>

#### Conclusion
To conclude this section, we will drop the comments without any text as our senitment analysis step couldn't classify any sentiment on these comments.

In [91]:
# Drop all comments without text
comments_df = comments_df[comments_df["text"].notnull()]
comments_df.isnull().sum()

_id                   0
author                0
parent_id          7683
post_id               0
sentiment.label       1
sentiment.score       1
subreddit             0
text                  0
thing_id              0
upvotes               0
dtype: int64

#### Finding further bot-generated content

During the process of null-value-handling we already removed the comments which were auto-generated by the bot "AutoModerator". Now lets try to find further bot-generated messages.

In [92]:
# Find all comments in which the author's name seems to point towards being a bot
bot_comments = comments_df[comments_df["author"].str.contains("bot", regex=False, case=False) & ~comments_df["author"].str.contains("robot|bottom|robbot|bottle|both|bott|rambot", regex=True, case=False)]["author"].value_counts()
bot_comments

author
Judgement_Bot_AITA     47
LSFBotUtilities        25
unexBot                20
livestreamfailsbot     10
justanotherbot123       1
bot_goodbot_bot         1
AmputatorBot            1
RUS_BOT_tokyo           1
PCMRBot                 1
BotElMago               1
forrealnoRussianbot     1
RepostSleuthBot         1
Name: count, dtype: int64

There is no way for us to clearly see if those are bots but the sample size is fairly small so we won't change the narrative by a lot when dropping those comments.

In [93]:
# Drop all comments by authors in the list of bot_comments
comments_df = comments_df[~comments_df["author"].isin(bot_comments.index)]

#### Looking at the comment(s) without an assigned sentiment

In [94]:
# Find comments in which the attributes sentiment.label and senitment.score is null and only show the column _id, text, sentiment.label and sentiment.score
comments_df[comments_df["sentiment.label"].isnull()][["_id", "text", "sentiment.label", "sentiment.score"]]

Unnamed: 0,_id,text,sentiment.label,sentiment.score
9668,645ea564c22e1f3b9b20b99b,:) :) :) :) :) :) :) :) :) :) :) :) :) :) :) :...,,


This user posted a comment that contained an array of smiley faces, made up of colons and closing brackets. Of course our language model cannot classify those with a sentiment, so we will drop this comment.

In [95]:
comments_df = comments_df[comments_df["_id"] != "645ea564c22e1f3b9b20b99b"]

# Analyzing the sentiments

In [101]:
fig = px.box(comments_df, x="sentiment.label", y="sentiment.score", title="Sentiment Score distribution per labels of Reddit Posts", notched=True)

fig.update_layout(
    xaxis_title="Sentiment Label",
    yaxis_title="Score"
)

fig.show()

The boxplot shows the negative and positive sentiments are fairly evenly distributed. The neutral sentiment seems to be less certain "neutral" than the other sentiments are in their respective sentiment label.

## Hypothesis 1 - The sentiment of a post is correlated to the sentiment of its comments

In [97]:
# For every post average the sentiment score of its comment but for every possible sentiment.label
post_sentiment = comments_df.groupby(["post_id", "sentiment.label"]).agg({"sentiment.score": "mean"}).reset_index()

# Pivot the dataframe to have the sentiment.label as columns
post_sentiment_pivot = post_sentiment.pivot(index="post_id", columns="sentiment.label", values="sentiment.score").reset_index()

# Add the columns of the pivot table to the posts_df
posts_df = pd.merge(posts_df, post_sentiment_pivot, on="post_id", how="left")