Fall 2025 - EPFL Applied Data Analysis Project
By Eugène Bergeron, Mattia Bianco, Andrea Bissoli, Florian Dejean and Fabio Marchetti
You can visit our Data Story website here!
You can also find the data's repository on GitHub.
Our project focuses on correlation between Reddit posts' sentiments and emotions and real life events.
We use the Reddit Hyperlink Network dataset and the related Subreddit Embeddings dataset, and integrate it with the Stock Market Dataset in order to introduce additional data points from the real world.
Our motivation stems from the hypothesis that large online platforms like Reddit act as social reflections of the anxieties and shifts of the physical world in near-real time. We have found out if, and in what measure, we can observe the measurable effects of seasons, stock market trends and high-impact real world events on how Reddit users interact with each other. Hence, our data story is focused on narrating how and if this collective mood shifts, providing a data-driven narrative.
Communities in Reddit: each node represents a community. Red nodes initiate more conflicts, while blue nodes do not. Communities are embedded using user-community information. Figure taken from the original paper.
Main question: Can we see an impact of real-world events on Reddit posts?
Sub-questions, driving our narrative:
- How do emotions vary over the seasons and months of the year?
- Is the stock market trend related to Reddit sentiment? Can you predict market behavios by observing sentiments of related Subreddits ?
- Can we leverage machine learning to effectively predict different aspects of Reddit sentiment based on real-world data?
-
Subreddit Embeddings Dataset: https://snap.stanford.edu/data/web-RedditEmbeddings.html. We leverage the embeddings dataset with the Reddit Hyperlink Network dataset to enrich our analysis with information on subreddit relationships and topics. We carried out an initial exploratory analysis on the embeddings to identify clusters of related subreddits. Even though not all subreddits in the Hyperlink Network dataset were present in the Embeddings dataset, the merged data will still be useful to filter and categorize posts based on subreddit topics.
-
Stock Market Dataset: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset. Used to map Reddit Hyperlink Network to real world data. Allowing us to understand how close are Reddit's behavior to the stock market of a given brand, such as Apple.
This website was made with the help of LLMs.
