# Introduction

The world of data science and machine learning is constantly evolving, with new techniques, breaking news and technologies emerging every day. One way to stay up-to-date with these trends is to engage with online communities such as Reddit. In this notebook, we'll use the Reddit API to collect data from three of the largest and most active subreddits focused on data science, machine learning, and artificial intelligence: `r/datascience`, `r/machinelearning`, and `r/artificial` and apply natural language processing techniques to extract insights from the top posts and their comments over the past few years. Through this project, we aim to gain a better understanding of current trends and sentiments in the data science community.  
By doing this, we hope to shed light on the attitudes and opinions of the community towards various topics. This project will focus on top posts and comments from the last few years.

# Setup

In [None]:
%pip install pygwalker praw

In [None]:
# Import libraries
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import pygwalker as pyg
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from transformers import pipeline
from datasets import Dataset

import praw

To be able to use this notebook its needed to create a Reddit app and get the credentials specified below. We recommend to store those crendentials in a .env file and load them using the dotenv package or, like in this case, to use the kaggle_secrets package.  
If you want to reuse the generated CSV files, you won't need to set up your Reddit credentials since these files will be loaded automatically in the following cells.

In [None]:
# Import to use Kaggle secrets
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
client_id = user_secrets.get_secret("CLIENT_ID")
client_secret = user_secrets.get_secret("CLIENT_SECRET")
redirect_uri = user_secrets.get_secret("REDIRECT_URI")
user_agent = user_secrets.get_secret("USER_AGENT")

In [None]:
# Setup matplotlib integration
%matplotlib inline

In [None]:
# Create Reddit client instance
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    redirect_uri=redirect_uri,
    user_agent=user_agent,
)

# Data collection

To collect the data we use the PRAW package, which is a Python wrapper for the Reddit API. To learn more about this package check the [documentation](https://praw.readthedocs.io/en/latest/).

In [None]:
def get_top_posts(subreddit_list="MachineLearning", limit=1_000, time_filter="all"):
    """
    Get top posts from a list of subreddits.
    """

    # initialize posts dataframe
    posts_df = []

    # get top posts from subreddit list
    posts = reddit.subreddit(subreddit_list).top(time_filter=time_filter, limit=limit)

    # loop through posts and append to dataframe
    for post in posts:
        posts_df.append(
            {
                "post_id": post.id,
                "subreddit": post.subreddit,
                "author": post.author,
                "created_utc": post.created_utc,
                "post_url": post.url,
                "post_title": post.title,
                "link_flair_text": post.link_flair_text,
                "score": post.score,
                "num_comments": post.num_comments,
                "upvote_ratio": post.upvote_ratio,
            }
        )

    return pd.DataFrame(posts_df)

In [None]:
def get_comments(post_ids, limit=None):
    """
    Get comments from a list of post ids.
    """

    # initialize comments dataframe
    comments_df = []

    for post_id in post_ids:
        # get comments from post
        submission = reddit.submission(id=post_id)
        submission.comments.replace_more(limit=limit)
        comments = submission.comments.list()

        # append comments to dataframe
        for comment in comments:
            comments_df.append(
                {
                    "author": comment.author,
                    "post_id": post_id,
                    "body": comment.body,
                    "distinguished": comment.distinguished,
                    "is_submitter": comment.is_submitter,
                    "score": comment.score,
                    "created_utc": comment.created_utc,
                }
            )

    return pd.DataFrame(comments_df)

In [None]:
# Output paths
posts_csv_path = "/kaggle/working/DS_ML_AI_posts.csv"
comments_csv_path = "/kaggle/working/DS_ML_AI_comments.csv"

While the download of the data related to the posts does not suppose a significan cost in time, the download of the comments is a very time consuming process. If you want to download the data again just delete the CSV files and rerun the notebook.

In [None]:
# Load posts csv file if it exists, if not get top posts and save to csv
if os.path.exists(posts_csv_path):
    posts_df = pd.read_csv(posts_csv_path)
else:
    posts_df = get_top_posts(
        subreddit_list="MachineLearning+datascience+artificial",
        limit=3_000,
        time_filter="all",
    )
    posts_df.to_csv(posts_csv_path, index=False, header=True)

In [None]:
# Load comments csv file if it exists, if not fetch comments and save to csv
if os.path.exists(comments_csv_path):
    comments_df = pd.read_csv(comments_csv_path)
else:
    comments_df = get_comments(
        post_ids=posts_df["post_id"].values,
        limit=None,
    )
    comments_df.to_csv(comments_csv_path, index=False, header=True)

In [None]:
posts_df.shape

In [None]:
comments_df.shape

In [None]:
posts_df.head()

In [None]:
comments_df.head()

In [None]:
# Number of posts by subreddit
posts_df.groupby("subreddit")["post_id"].count()

In [None]:
# Number of comments by subreddit
posts_df.merge(comments_df, on="post_id").groupby("subreddit")["body"].count()

# Data processing

The data processing step for this project is relatively simple, as we can leverage the rich data source provided by the Reddit API. However, we will still need to perform some preprocessing steps to clean up the data. For example, we will:
- Remove columns that we will not be using in the analysis.
- Clean up comments made by bots, such as those made by u/RemindMeBot, or comments that don't have a body.
- Add the subreddit to the comments dataframe.
- Create a new year column by parsing the created_utc column to a datetime object, which will allow us to analyze trends over time.

In [None]:
# Convert created_utc to datetime on posts dataframe
posts_df["created_utc"] = pd.to_datetime(posts_df["created_utc"], unit="s")
posts_df.head()

In [None]:
# Convert created_utc to datetime on comments dataframe
comments_df["created_utc"] = pd.to_datetime(comments_df["created_utc"], unit="s")
comments_df.head()

In [None]:
# Create year column on posts dataframe
posts_df["year"] = posts_df["created_utc"].dt.year
posts_df["year"].astype(int)
posts_df.head()

In [None]:
# Create year column on comments dataframe
comments_df["year"] = comments_df["created_utc"].dt.year
comments_df.head()

In [None]:
# Remove columns that are not needed for the analysis from posts dataframe
posts_df = posts_df.drop(columns=["post_url"])

In [None]:
# Remove columns that are not needed for the analysis from the comments dataframe
comments_df = comments_df.drop(columns=["distinguished", "is_submitter"])

In [None]:
# Check rows with nan values in posts dataframe
posts_df.isna().sum()

In [None]:
# Check rows with nan values in comments dataframe
comments_df.isna().sum()

In [None]:
# Remove comments with no body
comments_df.dropna(inplace=True, subset=["body"])

In [None]:
# Convert comment body to string type
comments_df["body"] = comments_df["body"].astype(str)

In [None]:
# Convert post title to string type
posts_df["post_title"] = posts_df["post_title"].astype(str)

In [None]:
# Convert link_flair_text to Categorical type in posts dataframe
posts_df["link_flair_text"] = posts_df["link_flair_text"].astype('category')

In [None]:
# Remove u/RemindMeBot related comments
comments_df.drop(comments_df[comments_df.author == "RemindMeBot"].index, inplace=True)
comments_df.drop(comments_df[comments_df["body"].str.contains("RemindMe!")].index, inplace=True)

In [None]:
# Add subreddit to comments dataframe
comments_df = comments_df.merge(posts_df[['post_id', 'subreddit']], on='post_id', how='left')
comments_df.head()

# EDA

The EDA (Exploratory Data Analysis) part in this notebook is relatively straightforward. The main goal is to gain a general understanding of the data and identify any trends or patterns within the data. To make this process even more straightforward, we have installed and imported `pygwalk`. To use this package, simply uncomment the code below. This package allows us to create interactive plots and explore the data in a more dynamic way.

In [None]:
# Use pygwalker to explore the data and get insights more easily
# gwalker = pyg.walk(posts_df)

In [None]:
# Set sns style for the following plots
sns.set_style("whitegrid")

In the next plot we can see the number of posts per year in the three subreddits and how the number of posts has increased over the years. But it's also worth noting that the posts being analyzed in this notebook are only the top posts of each subreddit so the following analysis is not representative of the whole community or the real activity in the subreddits. We'll also take a look at the number of comments.

In [None]:
# Number of posts by year on each subreddit
plt.figure(figsize=(12, 8))

sns.countplot(
    x="year",
    hue="subreddit",
    data=posts_df,
    palette="Set2",
)

In [None]:
# Number of comments per Year for each subreddit
plt.figure(figsize=(12, 8))

sns.countplot(
    x="year",
    hue="subreddit",
    data=comments_df,
    palette="Set2",
)

In [None]:
# Average score per post
plt.figure(figsize=(12, 8))

sns.boxplot(
    x="subreddit",
    y="score",
    data=posts_df,
    palette="Set2",
)

In [None]:
# Number of posts by flair
fig = plt.figure(figsize=(12, 8))

sns.countplot(
    y="link_flair_text",
    data=posts_df,
    palette="Set2",
    order=posts_df["link_flair_text"].value_counts().index,
)

In [None]:
# Check trending terms on post titles using wordcloud
plt.figure(figsize=(12, 8))

posts_title_text = " ".join([title for title in posts_df["post_title"].str.lower()])

wcloud = WordCloud(
    collocation_threshold=2, width=960, height=540, background_color="white"
).generate(posts_title_text)

plt.axis("off")
plt.imshow(wcloud)
plt.show()

By analyzing the number of posts and comments per year for each subreddit from 2013 to the first quarted of 2023, we can see that these communities on Reddit have experienced significant growth over the past years. The `r/datascience` subreddit seems to have become the most active community with a substantial increase in activity since 2020. In comparison, the levels of activity on `r/machinelearning` and `r/artificial` has remained relatively stable over the same period although both have also experienced clear growth.  
These findings suggest that the data science community on Reddit is rapidly expanding, with an increasing number of individuals seeking out information and resources related to data science and machine learning. This growth may reflect a broader trend towards the democratization of data science, as more people become interested in and involved in the field.

# Sentiment analysis

To perform the sentiment analysis, we have decided to use the roBERTa-based model [cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) from the [Huggingface](https://huggingface.co/) library. This model has been pre-trained on a large corpus of Twitter data and fine-tuned for sentiment analysis, making it well-suited for our purposes.  
The labels provided by the model are:
- `0` for positive.
- `1` for neutral
- `2` for positive.

It is important to note that the model has certain limitations and may not provide 100% accurate results. Since the model was trained on tweets, it may not perform as well on Reddit comments due to the differences in tone and language. The model may also struggle with identifying sarcasm or other forms of nuanced language commonly used on Reddit. However, despite these limitations, the analysis can still offer valuable insights into the overall sentiment of comments in the data science subreddits.

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
pipe = pipeline("sentiment-analysis", model=MODEL, tokenizer=MODEL, max_length=512, truncation=True, device=0)

In [None]:
# Test the model
pipe("I love this! 💕")

In [None]:
pipe("I hate this! 🤮")

# Sentiment analysis of comments related to ChatGPT

In this section we'll analyze the sentiment expressed in comments related to ChatGPT, an advanced language model developer by OpenAI. To do this, we'll analyze the distribution of sentiments in comments that contain keywords related to ChatGPT.

In [None]:
# Create dataframe with GPT related comments
gpt_comments_df = comments_df[
    comments_df["body"].str.contains(
        "chat gpt|chatgpt|gpt", regex=True, case=False,
    )
]

gpt_comments_df = gpt_comments_df.reset_index(drop=True)
gpt_comments_df.shape

In [None]:
gpt_comments_df.head()

In [None]:
# Create dataset with gpt_comments_df's body column
gpt_comments_dataset = Dataset.from_pandas(gpt_comments_df[["body"]])
gpt_comments_dataset

In [None]:
# Make sentiment analysis of ChatGPT related comments
results = pipe(gpt_comments_dataset["body"])
results[:10]

In [None]:
# Convert sentiment analysis results into a pd.Series
sentiment_series = pd.Series([result["label"] for result in results])
sentiment_series

In [None]:
# Concat gpt_comments_df with sentiments analysis results
gpt_comments_df = pd.concat([gpt_comments_df, sentiment_series], axis=1).rename(columns={0: 'sentiment'})
gpt_comments_df.head()

In [None]:
# Number of comments for each sentiment category
gpt_comments_df["sentiment"].value_counts()

In [None]:
# Sentiments distribution in ChatGPT related comments
plt.figure(figsize=(12, 8))

sns.countplot(
    x="sentiment",
    data=gpt_comments_df,
    palette="Set2",
)

In [None]:
# Sentiments distribution in ChatGPT related comments by subreddit
plt.figure(figsize=(12, 8))

sns.countplot(
    x="subreddit",
    hue="sentiment",
    data=gpt_comments_df,
    palette="Set2",
)

According to the sentiment analysis results, the majority of comments related to ChatGPT are neutral, indicating that users express balanced opinions about this technology. However, a significant number of negative comments also exist, suggesting areas of concern within the community regarding its potential impact on areas such as the workplace, personal development, and the emergence or loss of jobs

# Sentiment Overview

In [None]:
# Create dataset with comments_df body column
comments_dataset = Dataset.from_pandas(comments_df[["body"]])
comments_dataset

In [None]:
# Make sentiment analysis
results = pipe(comments_dataset["body"])
results[:10]

In [None]:
# Convert sentiment analysis results into a pd.Series
sentiment_series = pd.Series([result["label"] for result in results])
sentiment_series

In [None]:
# Concat comments_df with sentiments analysis results
comments_df = pd.concat([comments_df, sentiment_series], axis=1).rename(columns={0: 'sentiment'})
comments_df.head()

In [None]:
# Save comments with sentiment analysis results into CSV
comments_with_sentiment_csv_path = "/kaggle/working/DS_ML_AI_comments_with_sentiments.csv"
comments_df.to_csv(comments_with_sentiment_csv_path, index=False, header=True)

In [None]:
# Number of comments for each sentiment category
comments_df["sentiment"].value_counts()

In [None]:
# Sentiments distribution in comments
plt.figure(figsize=(12, 8))

sns.countplot(
    x="sentiment",
    data=comments_df,
    palette="Set2",
)

Following the sentiment analysis performed on the ChatGPT related comments, we performed a sentiment analysis on all comments in the dataset. The results reveal that, similar to the previous analysis, the majority of comments are neutral followed by negative comments and lastly, positive ones. This finding suggests that users tend to express fairly balanced views on the topics discussed in these comments, albeit with a certain level of dissatisfaction or criticism.