# Sentiment Analysis on Reddit Posts and Comments

In this notebook we explore new findings about the collected Reddit data. The data includes posts and comments which belong to posts. Comments can have a threaded structures.

In [144]:
import numpy as np
import pandas as pd
import plotly.express as px

### Importing Data

In [145]:
data_path = "../data/"
posts_df = pd.read_csv(data_path + "posts_sentiment.csv")
comments_df = pd.read_csv(data_path + "comments_sentiment.csv")

## Taking a glimpse into the data

Lets look at how much data we have for each subreddit.

In [146]:
posts_count = posts_df.groupby("subreddit").size().reset_index(name="post_count")
comments_count = comments_df.groupby("subreddit").size().reset_index(name="comment_count")
merged_df = pd.merge(posts_count, comments_count, on="subreddit", how="outer")
merged_df.fillna(0, inplace=True)

merged_df

Unnamed: 0,subreddit,post_count,comment_count
0,AmItheAsshole,50,806
1,AskReddit,50,877
2,ChatGPT,31,407
3,Damnthatsinteresting,25,403
4,Home,26,295
5,HonkaiStarRail,27,361
6,LivestreamFail,28,334
7,Unexpected,23,328
8,WhitePeopleTwitter,27,465
9,facepalm,25,388


For the subreddit "worldnewsvideo" we only have one observation in the posts dataset. We will drop this subreddit as it does not provide enough data to be analyzed.

In [147]:
posts_df = posts_df[posts_df["subreddit"] != "worldnewsvideo"]
comments_df = comments_df[comments_df["subreddit"] != "worldnewsvideo"]

### Checking for null values

In [148]:
posts_df.isnull().sum()

_id                0
author             0
permalink          0
post_id            0
sentiment.label    0
sentiment.score    0
subreddit          0
title              0
dtype: int64

In [149]:
comments_df.isnull().sum()

_id                   0
author                0
parent_id          7896
post_id               0
sentiment.label       1
sentiment.score       1
subreddit             0
text                232
thing_id              0
upvotes               0
dtype: int64

- Some comments seem to not have a text
- One observation of comments seems to not have a sentiment

Lets look into those points.

## Missing-Value-Handling and furhter explorations

Taking a look into authors of empty comments:

In [150]:
comments_df[comments_df["text"].isnull()]["author"].value_counts()

author
AutoModerator          148
SquatCorgiLegs           2
todang                   2
mohiben                  2
han_bylo                 2
                      ... 
WallStreetDoesntBet      1
howiecash                1
Sir-Kevly                1
Overweighover            1
Uchihagod53              1
Name: count, Length: 81, dtype: int64

The amount of isnull comments by the user "AutoModerator", which is a bot, is fairly high. We primarily want to do research on user-generated content, so we will drop all comments by the user "AutoModerator". Otherwise there doesn't seem to be a biased amount of comments by any other user.

In [151]:
comments_df = comments_df[comments_df["author"] != "AutoModerator"]

In [152]:
comments_df[comments_df["text"].isnull()]["subreddit"].value_counts()

subreddit
nba                     11
wallstreetbets          10
shitposting              9
memes                    8
Unexpected               8
pcmasterrace             6
mildlyinfuriating        6
pics                     5
WhitePeopleTwitter       4
gaming                   3
worldnews                3
LivestreamFail           3
HonkaiStarRail           2
news                     2
leagueoflegends          2
Damnthatsinteresting     1
therewasanattempt        1
Name: count, dtype: int64

In [153]:
# Get number of isnull comments per subreddit
isnull_comments = comments_df[comments_df["text"].isnull()]["subreddit"].value_counts()

all_comments = comments_df["subreddit"].value_counts()

isnull_comments = isnull_comments.reindex(all_comments.index, fill_value=0)

# Plot percentage of isnull_comments per subreddit
fig = px.bar(x=isnull_comments.index, y=(isnull_comments.values/all_comments.values)*100,
             title="Percentage of empty comments per subreddit",
             labels={"x": "Subreddit", "y": "Percentage (%) of isnull comments"})
fig.show()

Now lets take a closer look to why some of these posts could be empty.

In [154]:
# Take random seeded sample of 3 isnull comments
comments_df[comments_df["text"].isnull()].sample(n=3, random_state=1337)

Unnamed: 0,_id,author,parent_id,post_id,sentiment.label,sentiment.score,subreddit,text,thing_id,upvotes
4376,645e9b48c22e1f3b9b20a1f7,JakeWilling,,13fs28d,negative,0.3473,LivestreamFail,,t1_jjwik8n,-15
9819,645ea5b2c22e1f3b9b20ba46,FPS-owner97,t1_jjujhb5,13fedlp,negative,0.3473,memes,,t1_jjujvsr,79
3808,645e9a3dc22e1f3b9b209f6f,CoffeeDreamsLite,,13f0qqa,negative,0.3473,mildlyinfuriating,,t1_jjtmlhe,20


Now, lets look at those three random samples:

#### Case study 1 - Hidden by negative downvotes on r/LiveStreamFail
The first comment seems to have been **hidden by reddit** as it has received a high number of downvotes.

<div style="text-align: center;">
  <img src="./assets/r_lsf_hidden_comment.png" alt="Hidden Comment on r/LiveStreamFail" style="width: 10%">
</div>

If we extend the comment manually, while logged in, we can see the comment. This is something that we were not able to avoid in the scraping process as reddit does not allow anonymous users to expand hidden comments - one would need to be logged in to use that functionality.

#### Case study 2 - GIFs as comments on r/memes
The second comment seems to only contain a **GIF**, this is definitely something we wouldn't want to analyze as it does not contain any text.

<div style="text-align: center;">
  <img src="./assets/r_memes_giphy.png" alt="GIF Comment on r/memes" style="width: 10%">
</div>

The scraper itself would also not be able to extract text from GIFs, so we will drop all comments that only contain GIFs.

#### Case study 3 - GIFs as comments on r/mildlyinfuriating
The third comment seems to have the same issue as the comment in the second case study:

<div style="text-align: center;">
  <img src="./assets/r_mildlyinfuriating_giphy.png" alt="GIF Comment on r/memes" style="width: 10%">
</div>

#### Conclusion
To conclude this section, we will drop the comments without any text as our senitment analysis step couldn't classify any sentiment on these comments.

In [155]:
# Drop all comments without text
comments_df = comments_df[comments_df["text"].notnull()]
comments_df.isnull().sum()

_id                   0
author                0
parent_id          7683
post_id               0
sentiment.label       1
sentiment.score       1
subreddit             0
text                  0
thing_id              0
upvotes               0
dtype: int64

#### Finding further bot-generated content

During the process of null-value-handling we already removed the comments which were auto-generated by the bot "AutoModerator". Now lets try to find further bot-generated messages.

In [156]:
# Find all comments in which the author's name seems to point towards being a bot
bot_comments = comments_df[comments_df["author"].str.contains("bot", regex=False, case=False) & ~comments_df["author"].str.contains("robot|bottom|robbot|bottle|both|bott|rambot", regex=True, case=False)]["author"].value_counts()
bot_comments

author
Judgement_Bot_AITA     47
LSFBotUtilities        25
unexBot                20
livestreamfailsbot     10
justanotherbot123       1
bot_goodbot_bot         1
AmputatorBot            1
RUS_BOT_tokyo           1
PCMRBot                 1
BotElMago               1
forrealnoRussianbot     1
RepostSleuthBot         1
Name: count, dtype: int64

There is no way for us to clearly see if those are bots but the sample size is fairly small so we won't change the narrative by a lot when dropping those comments.

In [157]:
# Drop all comments by authors in the list of bot_comments
comments_df = comments_df[~comments_df["author"].isin(bot_comments.index)]

#### Looking at the comment(s) without an assigned sentiment

In [158]:
# Find comments in which the attributes sentiment.label and senitment.score is null and only show the column _id, text, sentiment.label and sentiment.score
comments_df[comments_df["sentiment.label"].isnull()][["_id", "text", "sentiment.label", "sentiment.score"]]

Unnamed: 0,_id,text,sentiment.label,sentiment.score
9668,645ea564c22e1f3b9b20b99b,:) :) :) :) :) :) :) :) :) :) :) :) :) :) :) :...,,


This user posted a comment that contained an array of smiley faces, made up of colons and closing brackets. Of course our language model cannot classify those with a sentiment, so we will drop this comment.

In [159]:
comments_df = comments_df[comments_df["_id"] != "645ea564c22e1f3b9b20b99b"]

# Analyzing the sentiments

In [160]:
fig = px.box(comments_df, x="sentiment.label", y="sentiment.score", title="Sentiment Score distribution per labels of Reddit Posts", notched=True)

fig.update_layout(
    xaxis_title="Sentiment Label",
    yaxis_title="Score"
)

fig.show()

The boxplot shows the negative and positive sentiments are fairly evenly distributed. The neutral sentiment seems to be less certain "neutral" than the other sentiments are in their respective sentiment label.

## Hypothesis: The sentiment of a post is correlated to the sentiment of its comments

In [161]:
# For every post average the sentiment score of its comment but for every possible sentiment.label
post_sentiment = comments_df.groupby(["post_id", "sentiment.label"]).agg({"sentiment.score": "mean"}).reset_index()

# Pivot the dataframe to have the sentiment.label as columns
post_sentiment_pivot = post_sentiment.pivot(index="post_id", columns="sentiment.label", values="sentiment.score").reset_index()

# Add the columns of the pivot table to the posts_df
posts_df = pd.merge(posts_df, post_sentiment_pivot, on="post_id", how="left")

In [162]:
posts_df

Unnamed: 0,_id,author,permalink,post_id,sentiment.label,sentiment.score,subreddit,title,negative,neutral,positive
0,645e93a2c22e1f3b9b208ebf,leastlyharmful,https://www.reddit.com/r/Home/comments/13flrib...,13flrib,negative,0.7577,Home,crappy_tile_cut_around_baseboard_heater_from,0.752629,0.514022,0.62715
1,645e93aac22e1f3b9b208ed4,Bamieclif,https://www.reddit.com/r/Home/comments/13fmwor...,13fmwor,neutral,0.4997,Home,what_is_this_and_how_do_i_replace_it_ceiling_l...,0.800150,0.556137,0.79592
2,645e93aec22e1f3b9b208edd,UnstableProgram,https://www.reddit.com/r/Home/comments/13fqsh3...,13fqsh3,negative,0.7246,Home,weird_drywall_construction,,0.631029,
3,645e93b0c22e1f3b9b208ee1,queef_quencher,https://www.reddit.com/r/Home/comments/13fqodc...,13fqodc,neutral,0.3997,Home,best_way_to_weather_seal_bottom_of_door,0.547200,0.404400,
4,645e93b7c22e1f3b9b208ef6,TerenceMulvaney,https://www.reddit.com/r/Home/comments/13eozd7...,13eozd7,neutral,0.6065,Home,clearance_is_clearance,0.614775,0.624680,0.63400
...,...,...,...,...,...,...,...,...,...,...,...
745,645ea79dc22e1f3b9b20bf20,itsfirePVP,https://www.reddit.com/r/ChatGPT/comments/13fe...,13fe4p5,positive,0.4451,ChatGPT,thank_you_chatgpt,0.807913,0.611440,0.74045
746,645ea79fc22e1f3b9b20bf24,jed3c,https://www.reddit.com/r/ChatGPT/comments/13fr...,13frslu,neutral,0.5519,ChatGPT,the_apology_rate_off_the_charts,,0.654000,
747,645ea7a2c22e1f3b9b20bf28,kk126,https://www.reddit.com/r/ChatGPT/comments/13ft...,13fto4n,neutral,0.5052,ChatGPT,is_this_the_new_as_an_ai_language_model,,0.811400,
748,645ea7a4c22e1f3b9b20bf2c,kk126,https://www.reddit.com/r/ChatGPT/comments/13ft...,13fto4n,neutral,0.5052,ChatGPT,is_this_the_new_as_an_ai_language_model,,0.811400,


In [163]:
# add an "avg_label" and "avg_score" attribute to posts_df which sets the most common label in its comments as the "avg_label" and the average sentiment score for "avg_score"
posts_df["avg_label"] = posts_df[["negative", "neutral", "positive"]].idxmax(axis=1)
posts_df["avg_score"] = posts_df[["negative", "neutral", "positive"]].max(axis=1)
posts_df

Unnamed: 0,_id,author,permalink,post_id,sentiment.label,sentiment.score,subreddit,title,negative,neutral,positive,avg_label,avg_score
0,645e93a2c22e1f3b9b208ebf,leastlyharmful,https://www.reddit.com/r/Home/comments/13flrib...,13flrib,negative,0.7577,Home,crappy_tile_cut_around_baseboard_heater_from,0.752629,0.514022,0.62715,negative,0.752629
1,645e93aac22e1f3b9b208ed4,Bamieclif,https://www.reddit.com/r/Home/comments/13fmwor...,13fmwor,neutral,0.4997,Home,what_is_this_and_how_do_i_replace_it_ceiling_l...,0.800150,0.556137,0.79592,negative,0.800150
2,645e93aec22e1f3b9b208edd,UnstableProgram,https://www.reddit.com/r/Home/comments/13fqsh3...,13fqsh3,negative,0.7246,Home,weird_drywall_construction,,0.631029,,neutral,0.631029
3,645e93b0c22e1f3b9b208ee1,queef_quencher,https://www.reddit.com/r/Home/comments/13fqodc...,13fqodc,neutral,0.3997,Home,best_way_to_weather_seal_bottom_of_door,0.547200,0.404400,,negative,0.547200
4,645e93b7c22e1f3b9b208ef6,TerenceMulvaney,https://www.reddit.com/r/Home/comments/13eozd7...,13eozd7,neutral,0.6065,Home,clearance_is_clearance,0.614775,0.624680,0.63400,positive,0.634000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,645ea79dc22e1f3b9b20bf20,itsfirePVP,https://www.reddit.com/r/ChatGPT/comments/13fe...,13fe4p5,positive,0.4451,ChatGPT,thank_you_chatgpt,0.807913,0.611440,0.74045,negative,0.807913
746,645ea79fc22e1f3b9b20bf24,jed3c,https://www.reddit.com/r/ChatGPT/comments/13fr...,13frslu,neutral,0.5519,ChatGPT,the_apology_rate_off_the_charts,,0.654000,,neutral,0.654000
747,645ea7a2c22e1f3b9b20bf28,kk126,https://www.reddit.com/r/ChatGPT/comments/13ft...,13fto4n,neutral,0.5052,ChatGPT,is_this_the_new_as_an_ai_language_model,,0.811400,,neutral,0.811400
748,645ea7a4c22e1f3b9b20bf2c,kk126,https://www.reddit.com/r/ChatGPT/comments/13ft...,13fto4n,neutral,0.5052,ChatGPT,is_this_the_new_as_an_ai_language_model,,0.811400,,neutral,0.811400


In [164]:
# Scatterplot of the avg_score vs sentiment.score, colored in by avg_label and facetted by the sentiment.label
fig = px.scatter(posts_df, x="avg_score", y="sentiment.score", color="avg_label", facet_col="sentiment.label",
                 title="Average sentiment score of comments vs sentiment score of post",
                 labels={"avg_label": "Most common label among comments"})

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]+ " posts"))

# Change the axis titles for each facet
fig.for_each_xaxis(lambda a: a.update(title_text="Average Score of Comments"))
fig.for_each_yaxis(lambda a: a.update(title_text="Score of Post"))

fig.show()

As we can see, the scatters do not show a very clear structure or correlation between sentiments of the post and the sentiment average of the comment section.

In [165]:
# Calculate the correlation matrix
correlation_matrix = posts_df[["negative", "neutral", "positive"]].corr()

fig = px.imshow(correlation_matrix,
                x=correlation_matrix.columns,
                y=correlation_matrix.columns,
                color_continuous_scale='RdBu',
                title='Correlation between sentiment of post and its comment section')

fig.update_layout(width=500, height=500)

annotations = []
for i, row in enumerate(correlation_matrix.values):
    for j, value in enumerate(row):
        annotations.append(dict(
            x=j,
            y=i,
            text=f"{value:.2f}",
            showarrow=False,
            font=dict(color="white")
        ))

fig.update_layout(annotations=annotations)

fig.show()

## Which Subreddit is the most positive/negative?

In [166]:
# Group the posts by subreddit and aggregate the sentiment.score by its mean
subreddit_sentiment = posts_df.groupby(["subreddit", "sentiment.label"]).agg({"sentiment.score": "mean"}).reset_index()

# Plot all subreddits in a bar chart, facetted by sentiment.label
fig = px.bar(subreddit_sentiment, x="subreddit", y="sentiment.score", color="sentiment.label", facet_col="sentiment.label",
             title="Average sentiment score of posts per subreddit",
             labels={"subreddit": "", "sentiment.score": "Average Score", "sentiment.label": "Sentiment Label"})
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]+ " posts"))

fig.update_xaxes(ticktext=subreddit_sentiment["subreddit"], tickvals=subreddit_sentiment["subreddit"])

fig.show()

- When looking at **negative posts** the subreddit "movies" seems to be the most negative one.
- In the **neutral posts** the subreddit "wallstreetbets" is also the most neutral one.
- The most **positive subreddit** seems to be the subreddit "damnthatsinteresting".

## Are there users that are more positive/negative than others?

In [167]:
# Group the comments by author and aggregate the sentiment.score by its mean and also add the number of comments or posts they wrote
author_sentiment = comments_df.groupby(["author", "sentiment.label"]).agg({"sentiment.score": "mean", "_id": "count"}).reset_index()

author_sentiment

Unnamed: 0,author,sentiment.label,sentiment.score,_id
0,----_1_----,positive,0.5574,1
1,---ShineyHiney---,negative,0.9177,1
2,-Alexandros,negative,0.8538,1
3,-Alexandros,neutral,0.5423,1
4,-Alter-Reality-,neutral,0.6701,2
...,...,...,...,...
8836,zorbathegrate,negative,0.9578,1
8837,zsxking,neutral,0.4158,1
8838,zwaaa,negative,0.9247,1
8839,zweivierdrei,neutral,0.4492,2


## Is there a correlation between comment upvotes and comment sentiment?

In [168]:
# Scatterplot of the sentiment.score vs upvotes, colored in by sentiment.label and facetted by the sentiment.label
fig = px.scatter(comments_df, x="sentiment.score", y="upvotes", color="sentiment.label", facet_col="sentiment.label",
                 title="Sentiment score of comments vs upvotes of comments",
                 labels={"sentiment.score": "Score", "upvotes": "Upvotes"},
                 opacity=0.2)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]+ " comments"))
fig.show()

The scatters again do not show a clear structure or correlation. Lets see if we can see some more correlation when only looking at the a more narrow range of upvotes.

In [169]:
cap = 1000

fig = px.scatter(comments_df[comments_df["upvotes"] < cap], x="sentiment.score", y="upvotes", color="sentiment.label", facet_col="sentiment.label",
                 title="Sentiment score of comments vs upvotes of comments",
                 labels={"sentiment.score": "Score", "upvotes": "Upvotes"},
                 opacity=0.2)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]+ " comments"))
fig.show()

Still no clear structure for comments with less than 1000 upvotes. Lets see if theres more structure for the absolute top comments.

In [170]:
min = 1000

fig = px.scatter(comments_df[comments_df["upvotes"] > min], x="sentiment.score", y="upvotes", color="sentiment.label", facet_col="sentiment.label",
                 title="Sentiment score of comments vs upvotes of comments",
                 labels={"sentiment.score": "Score", "upvotes": "Upvotes"},
                 opacity=0.2)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]+ " comments"))
fig.show()

Same result for the top comments.