# EDA

## Executive Summary

Our exploratory data analysis (EDA) on Reddit's major political and economic subreddits has provided a nuanced view of online political discourse, offering valuable context for understanding user behavior. The analysis delved into various aspects of these online subreddit communities, shedding light on the dynamics of discussions and the interplay of user interests.

Preliminary analysis unveiled that `r/Conservative` is the most active subreddit, with the highest number of posts and comments, while `r/AskPolitics` has the most distinct posts and comments, indicating a more diverse user base. The `r/Conservative` subreddit also has the highest average post score, suggesting that its content resonates strongly with its user base. The analysis also revealed that the `r/Conservative` subreddit has the most significant number of users in common with other subreddits, indicating a shared user base and potentially similar ideological leanings. On the other hand, `r/Socialism` and `r/Liberal` subreddits showcase consistent activity, with fewer fluctuations across number of comments and submissions, suggesting a more stable user base.

Additionally, weekly patterns of posting activity revealed that Thursdays are the most active days for posting across `r/Liberterian`, `r/Socialism`, and `r/Centrist` subreddits, with the exception of `r/Conservative`, which sees the highest activity on Wednesdays. The analysis also showed that posting activity generally decreases as the year progresses, with the lowest numbers appearing in December. The data also revealed that Saturdays and Sundays consistently have fewer posts than weekdays, in line with typical online engagement trends.

To get a better idea of the economics and politics subreddits' similarities, we obtained counts for two curated word lists, one encompassing popular economic events and the other containing recent prominent political figures in the U.S. We found that posts' titles are a better tool to analyze than posts' bodies when searching for these words. This is because the posts' bodies contain links to online news articles, while the posts' titles contain the key words that we are looking for. We also found that the economic subreddits contain a large number of "shitposts", related to cryptocurrencies and NFT's, which is why we had to update the word list to account for these words. The political subreddits, on the other hand, do not have a similar distribution of counts across the grouped subreddits, as shown in both Table 3 and Table 4.

Finally, we incorporated U.S GDP data from the FRED API from January 2021 to March 2023 to compare the economic data with the `r/Economics` subreddit posting activity. The analysis revealed that there is a relationship between the state of the economy and the level of engagement and sentiment on the `r/Economics` subreddit. During times of economic uncertainty, there is an increase in the number of submissions and a decrease in the average submission score. This suggests that people are more likely to turn to online communities to discuss and seek information about the economy during difficult times. Additionally, the types of posts that are upvoted during these times suggest that people are looking for content that resonates with their concerns and anxieties.

## Analysis Report

### Overview of Subreddit Posts and Comments

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Line-Charts/posts/dist/index.html" width="900px" height="450px"></iframe>
<figcaption>Figure 1: Line Chart of number of posts (per subreddit)</figcaption>
</figure>

Figure 1 presents a line chart that tracks the posting activity across various political and economic subreddits from January 2021 to March 2023. The subreddit `r/Conservative` shows the most pronounced variability and has the highest peaks, suggesting periods of intense activity which might be attributed to specific political events or discussions that resonated strongly with its user base. It's noteworthy that the peaks for `r/Conservative` sharply spike above all other subreddits, indicating that certain topics or times drove engagement significantly more than usual.

In contrast, other subreddits such as `r/Economics`, `r/Finance`, and `r/ChangeMyView` maintain a relatively stable, low-level activity over time, with `Economics` occasionally showing slight increases that could correspond to economic events or policy discussions. Subreddits like `r/Liberal`, `r/Socialism`, `r/Libertarian`, and `r/Centrist` exhibit moderate activity with fewer fluctuations, reflecting a consistent level of engagement without the sharp spikes observed in the `r/Conservative` subreddit. The data suggests that while political subreddits can experience bursts of heightened activity, forums centered around economics and finance tend to have more steady, predictable participation rates.

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Line-Charts/distinct-posts/dist/index.html" width="900px" height="450px"></iframe>
<figcaption>Figure 2: Line Chart of number of posts by distinct number of users (per subreddit)</figcaption>
</figure>

Figure 2 illustrates the variability of unique user participation across different subreddits over two years. `r/AskPolitics` stands out with a consistently high percentage of distinct posts, reflecting a broad and diverse user engagement. On the contrary, the `r/Conservative` subreddit demonstrates lower and fluctuating levels of distinct posts, suggesting a smaller group of users frequently contributing to discussions. Subreddits like `r/Liberal`, `r/Socialism`, and `r/Libertarian` exhibit moderate variability, potentially influenced by political currents and events. `r/Finance` and `r/Economics` indicate a tendency towards a regular contributor base with fewer distinct users. These patterns provide a snapshot of the community dynamics within these subreddits, highlighting the differences in user diversity and engagement.

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Line-Charts/comments/dist/index.html" width="900px" height="450px"></iframe>
<figcaption>Figure 3: Line Chart of number of comments (per subreddit)</figcaption>
</figure>

Figure 3 tracks the volume of comments across various subreddits, highlighting how user interaction ebbs and flows over time. The "Conservative" subreddit shows the most pronounced fluctuations, with sharp spikes indicative of intense conversational bursts. The `r/Libertarian` subreddit also experiences notable, albeit less extreme, surges in activity, potentially aligning with key events or discussions. Other subreddits, including `r/Liberal`, `r/Socialism`, and `r/Centrist` present more consistent comment patterns, with occasional upticks that may correspond to specific events. The steady increase in the `r/Economics` subreddit comments towards the end of the timeline suggests growing engagement. This visualization reflects the dynamic nature of online discourse, with political forums often at the mercy of the news cycle, while more thematic forums like `r/Finance` or `r/Economics` maintain a steadier level of dialogue.

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Line-Charts/distinct-comments/dist/index.html" width="900px" height="450px"></iframe>
<figcaption>Figure 4: Line Chart of number of comments by distinct number of users (per subreddit)</figcaption>
</figure>

Figure 4 captures the comment activity trends by unique users across various subreddits over time, revealing the fluctuating levels of user engagement. The `r/Conservative` subreddit experiences the most substantial variations, with dramatic peaks suggesting periods of intense discussion among distinct users. The `r/Libertarian` subreddit also shows significant activity spikes, though these are more moderate in comparison, possibly reflecting the impact of particular events or debates that resonate with unique contributors. Subreddits such as `r/Liberal`, `r/Socialism` and `r/Centrist` display relatively stable levels of unique user comments, interspersed with slight rises potentially tied to specific topical triggers. In contrast, the `r/Economics` subreddit sees a gradual increase in comments by unique users as the timeline progresses, indicating a growing diversity in participant engagement. Overall, this figure illustrates the pulsating pattern of individual user participation across political and thematic subreddits, underscoring the ebb and flow of unique conversational contributions in these online communities.


### Subreddit Controversiality

The subreddits were ranked based on the counts of meaures that correspond with controversial posts and comments. As Figures 1 and 3 have shown already, there is greater number of posts and comments from the `r/Conservative` subreddit, so the counts shown in Table 1 have been normalized. Unsurprisingly, the `r/ChangeMyView` subreddit had the greatest number of distinguished posts and comments (being nearly 2 standard deviations from the mean), which may hint at heavy moderation, potentially as a result of touchy subjects being discussed. Ironically, the subreddits typically associated with the authoritarian left and right on the political compass (`r/Socialism` and `r/Conservative`, respectively) have less distinguished submissions and posts that the libertarian left and right subreddits (`r/Liberal` and `r/Libertarian`, respectively), pointing to an ironic inversion from the expectations of subreddit moderation. Additionally, the `r/Conservative` subreddit had the greatest number of gilded posts and comments. This, at a glance, may seem like users incentivizing each other financially with the prospects of Reddit Gold, which would inevitably promote an echo chamber of ideas in an online political space; however, this doesn't seem to be the case when considering that the subreddit also has the highest count of controversial comments (having a count of over 2 standard deviations from the mean). Instead, this may suggest that users on this subreddit are simply more willing to monetarily support posts and comments that they find appealing.

<figure>
  <img src='./data/plots/Table1.png' alt='Table 1: Subreddits with the Most Number of Controversial Posts and Comments'>
  <figcaption>Table 1: Subreddits with the Most Number of Controversial Posts and Comments</figcaption>
</figure>

Next, we looked at individual users and ranked them based on the counts of measures that corerspond with controversial posts and comments. Based on the distinguished submissions and comments, we can see that among the top 20 controversial users, only one of them (user: "ultimis") is a moderator. Another finding is that the top 20 users all come from the subreddits `r/Centrist`, `r/Libertarian`, or `r/Conservative`. Much like the findings from Table 1, Table 2 also seems to suggest that these three political subreddits have a large amount of intra-subreddit discourse among their posts and comments.

<figure>
<figcaption>Table 2: Top 20 Users with the Most Number of Controversial Posts and Comments</figcaption>
<iframe src="../data/plots/controversial_user.html" width="1000px" height="700px"></iframe>
</figure>


In [None]:
#| echo: false
#| label: fig-interactive-table
#| fig-cap: Top 20 Users with the Most Number of Controversial Posts and Comments

width_percentage = "100%"
IFrame(src='../data/plots/controversial_user.html', width=width_percentage, height=600)


### Subreddit Scores and Posts

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Bubble-Chart/dist/index.html" width="900px" height="900px"></iframe>
<figcaption>Figure 5: Bubble Chart showing the 50 highest scored posts per subreddit.</figcaption>
</figure>

Figure 5 is a bubble chart illustrating the average scores of the top 50 posts from various political subreddits, where the score is calculated by subtracting the number of downvotes from upvotes. The size of each subreddit's bubble reflects the average score of its posts. The chart shows that the subreddit represented by the largest bubble has the highest average post score, while the smallest bubble corresponds to the subreddit with the lowest average score.

The subreddit with the largest bubble, which could be indicative of a group such as `r/Conservative` or `r/Socialism` based on the provided color scheme and typical online engagement patterns, has significantly outpaced the others in terms of average score, suggesting that posts in this community resonate strongly with its members and receive more upvotes. Conversely, the subreddit with the smallest bubble, possibly representing a `r/Centrist` or `r/Libertarian` viewpoint, has the lowest average scores, indicating either a smaller community, less engagement, or a tendency towards more divisive content that doesn't amass as many upvotes.

Within each of these larger subreddit bubbles are smaller bubbles, each representing an individual post; the title of the post is indicated within these smaller bubbles. Additionally, interacting with these smaller bubbles by clicking on them will redirect the viewer to the respective post's link on Reddit for further exploration.


### Subreddits Monthly and Daily Activity
<figure>
  <iframe src="https://anthonymoubarak.georgetown.domains/Calendar-Heatmaps/Conservative/dist/index.html" width="800px" height="500px"></iframe>
  <figcaption>Figure 6: Calendar Heatmap of r/Conservative Reddit Posts</figcaption>
</figure>

Figure 6 depicts the daily and monthly number of posts on the `r/Conservative` subreddit. There's a clear trend of higher activity on weekdays, with Wednesdays being the peak, and lower activity on weekends, particularly on Sundays. The data indicates a seasonal trend, with post volumes generally higher in March and April and reducing towards the end of the year, in November and December. The color gradient emphasizes these variations, with darker hues representing more posts, allowing for a visual representation of user engagement on Reddit throughout the different months and days of the week.

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Calendar-Heatmaps/Liberterian/dist/index.html" width="800px" height="500px"></iframe>
<figcaption>Figure 7: Calendar Heatmap of r/Liberterian Reddit Posts</figcaption>
</figure>

Figure 7 depicts the daily and monthly number of posts on the `r/Liberterian` subreddit. Thursdays generally see the highest activity, particularly in the earlier months like January and February. Overall, the pattern shows a decrease in posting activity as the year progresses, with the lowest numbers appearing towards the end of the year in December. Saturdays and Sundays consistently show fewer posts than weekdays, in line with typical online engagement trends.

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Calendar-Heatmaps/Centrist/dist/index.html" width="800px" height="500px"></iframe>
<figcaption>Figure 8: Calendar Heatmap of r/Centrist Reddit Posts</figcaption>
</figure>

Figure 8 illustrates the daily and monthly number of posts on a subreddit with a `r/Centrist` perspective. Thursdays stand out as the most active day for posting, especially in the earlier part of the year, with January seeing the peak in activity. As the months advance, there is a noticeable trend of reduced posting activity, culminating in the lowest post counts in December. The pattern also reflects a common online behavior trend, where Saturdays and Sundays experience less posting activity compared to the busier weekdays

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Calendar-Heatmaps/Socialism/dist/index.html" width="800px" height="500px"></iframe>
<figcaption>Figure 9: Calendar Heatmap of r/Socialism Reddit Posts</figcaption>
</figure>

Figure 9 presents the daily and monthly number of posts on a subreddit related to `r/Socialism`. The heatmap shows a consistent pattern where weekdays have higher post counts, with Thursdays typically being the most active. The beginning of the year, especially January, has higher activity which appears to gradually decrease throughout the year, with December having some of the lowest numbers. The trend also follows the usual weekly cycle of online interactions, with Saturdays and Sundays having the least number of posts compared to weekdays, a common characteristic of user engagement in online communities.

### Common Users Across Subreddits

<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/Dependency-wheel/dist/index.html" width="800px" height="600px"></iframe>
<figcaption>Figure 10: Dependency Wheel of Users in different Subreddits</figcaption>
</figure>

The dependency wheel depicted in Figure 10 above provides a graphical representation of the shared user base between various political subreddits. The thickness of the connecting bands is directly proportional to the number of users that participate in both subreddits at each end of the band. A notable observation from the wheel is the significant interconnectivity between certain subreddits, which could suggest a shared ideological proximity or an interest overlap among their user bases. For instance, if there is a thick band between the subreddits labeled `r/Liberal` and `r/Socialism`, this would indicate a large common audience, possibly due to similar political leanings or discussions that appeal to both groups.

Additionally, the visualization reveals subreddits that serve as common grounds for diverse political discourse, such as `r/ChangeMyView` or `r/AskPolitics`, where the number of interlinking bands suggests a wide range of users from different political backgrounds converging for debate or inquiry. This could imply that these platforms are more neutral or open-ended, attracting a varied audience seeking to engage with multiple perspectives. The overall pattern highlights not only the segmentation within the political discourse on Reddit but also the points of intersection where cross-ideological conversations are occurring.

### Context Similarity Across Subreddit Posts

We present the Cosine and Jaccard similarity scores for posts' bodies and titles across grouped political and economics subreddits. The political subreddits utilized for the two tables below are `r/Conservative`, `r/Socialism`, `r/Centrist`, and `r/Libertarian` and the economic subreddits are `r/Finance` and `r/Economics`. Initially, we obtained the counts of three words, `[recession, inflation, and unemployment]`, for the grouped subreddits from regex patterns on the posts' bodies and ran Cosine as well as Jaccard Similarity scores on the counts. The Cosine Similarity score between the grouped subreddits was 0.9 and Jaccard Similarity was 0.04, which meant that the two subreddits have similar distributions of words, even though the counts of individual words were significantly different. This finding culminated in two hypotheses, the first being that the **grouped economic subreddits must be containing a large number of "shitposts"**, probably related to cryptocurrencies and NFT's, and the second one that **economics subreddits posts' bodies contain mainly links** to online news articles with the key words present in the post titles. Consequently, we updated the word list to account for `[crypto, blockchain, nft]` words contained in these flippant posts and also ran the similarity algorithms on the posts' titles.

In addition to the economics word list, we also made a political word list to obtain grouped subreddit counts for the words `[trump, biden, election, fed, powell]` and ran the similarity algorithms on the posts' bodies and titles. Its results and discussion are presented below.

#### Post Body

<p><br></p>
<figure>
<div style="text-align: right">
<iframe src="../data/plots/body_cosine_jaccard.html" width="800px" height="280px" scrolling="no"></iframe>
</div>
<figcaption>Table 3: Cosine and Jaccard Similarity of Post Body for combined Political and Economics subreddits</figcaption>
</figure>

#### Post Title

<p><br></p>
<figure>
<div style="text-align: right">
<iframe src="../data/plots/title_cosine_jaccard.html" width="800px" height="280px" scrolling="no"></iframe>
</div>
<figcaption>Table 4: Cosine and Jaccard Similarity of Post Title for combined Political and Economics subreddits</figcaption>
</figure>

Table 4 supports our hyptheses, as we can see that the Jaccard Similarity score is highest across the post titles of the grouped subreddits when we filter for the economics word list. This means that the posts' titles not only have a similar distribution of counts across the grouped subreddits but they also have similar raw counts for each individual word. This is because the posts' titles contain the key words that we are looking for, while the posts' bodies contain links to online news articles. On the other, political words in either the posts' bodies or titles do not have a similar distribution of counts across the grouped subreddits, as shown in both Table 3 and Table 4.

### Subreddit Activity & Economic Factors
#### GDP vs Subreddit Submissions
<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/GDP_Numbers/dist/index.html" width="800px" height="450px"></iframe>
<figcaption>Figure 11: Subreddit Submissions vs GDP</figcaption>
</figure>

Figure 11 presents a comparative analysis of the United States' Real Gross Domestic Product (GDP) against the number of submissions in the `r/Economics` subreddit over a period extending from January 2021 to March 2023. Evident in the plot, the submission count hovers around 1,000 posts per month, reflecting a steady engagement level within the subreddit community.

A notable deviation occurs in July 2022, marked by an annotation that points out the release of news indicating the U.S economy's contraction for two consecutive quarters, often considered a technical indicator of a recession. Following this announcement, there is a marked spike in subreddit activity in August and September 2022, with submissions surging to well over four times the usual number. This spike in activity suggests a heightened collective concern and a rush to discuss and understand the implications of the economic news

#### GDP vs Subreddit Scores
<figure>
<iframe src="https://anthonymoubarak.georgetown.domains/GDP_Scores/dist/index.html" width="800px" height="450px"></iframe>
<figcaption>Figure 12: Subreddit Scores vs GDP</figcaption>
</figure>

Figure 12 again plots the United States’ Real Gross Domestic Product (GDP), but this time against the average submission score in the `r/Economics` subreddit over a period extending from January 2021 to March 2023. Conversely, this visualization reveals a stark drop in the average score of submissions during the same period of economic concern. This significant downturn in average score might also be interpreted as a barometer of public sentiment towards the economy, rather than simply a measure of post quality. In times of economic uncertainty, it's common for sentiment to sour, and this shift is often reflected in the collective mood of discussions and the types of content that get upvoted. As the real GDP contracted for two consecutive quarters, the mood likely shifted from cautiously optimistic to more pessimistic and critical, leading to a preference for upvoting posts that resonate with the community's concerns and anxieties. However, it is also worth noting that since the number of posts in that period also significantly increased, it is likely that lower post activity (comments/upvotes) could also have impacted the average score during that period.