### **Homework #4 — Exploratory Data Analysis (EDA)**

**Project:** Analysis of trends in conspiracy theories during 2020 and the influence of the COVID-19 pandemiс

**Authors:** Hovoryshcheva Veronika, Morozova Polina

**Team ID:** 15

**Dataset:** normalized_data.csv

**Time spent:** blabla

In [1]:
# start_time = "13:53, 31.10.2025"

#import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns

### **Goal**

The goal of this exploratory data analysis is to understand Reddit discussions about conspiracy theories in 2020, focusing on dataset structure, user activity, and subreddit interactions. It also aims to examine sentiment and emotional tone, identify recurring narratives and key phrases, and explore topic-specific patterns, such as COVID-related discussions and skepticism toward mainstream information. The analysis seeks to reveal dominant narratives, engagement patterns, and language trends across communities.

### **List of the Research Questions**
#### Overview

1. How many total records does the dataset contain, and how are they divided between submissions and comments?
2. What is the timeframe covered by the dataset?
3. What basic information can be summarized about the dataset (columns, data types, missing values)?
4. Which specific conspiracy theories were most frequently discussed in 2020?

#### Activity and Distribution

5. Which are the top 10 most active subreddits by number of posts/comments?
6. Are there significant peaks of activity around major real-world events?
7. How many unique authors are there, and how many contributions did each make?
8. Which subreddits have the highest average score versus the highest post volume?
9. Do different subreddits experience synchronized activity peaks?
10. Do the most active authors post in many subreddits or focus on one community?
11. How does the average score (karma/upvotes) change over time?

#### Content Analysis

12. What is the overall sentiment distribution of all comments and submissions?
13. What are the most positive and most negative subreddits overall?
14. Are longer comments more emotionally expressive (stronger positive/negative values)?
15. How does sentiment vary between submissions and comments?
16. Do posts with more positive sentiment tend to get higher scores?

#### Interesring Findings

17. Can recurring narratives or metaphors be identified?
18. Which grammatical constructions are most common (imperative, interrogative, emotional)?
19. Has skepticism toward official statistics and mainstream media increased during the pandemic?
20. Does the language of users who discuss COVID differ from that of those discussing other conspiracy topics?
21. Do topics with religious undertones have longer discussion threads?
22. Which keywords most strongly co-occur with “COVID” or “virus”?

### **Overview**

#### **Q1** How many total records does the dataset contain, and how are they divided between submissions and comments?
Understanding the overall size and internal balance of the dataset helps evaluate the representativeness of the material and determine whether conspiracy theories spread mainly through original posts or through ongoing discussions in the comments.

In [8]:
import pandas as pd

df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\normalized_data.csv", low_memory=False)
#df.info()
#df.head()
df.shape, df['is_submission'].value_counts()

((4285280, 8),
 is_submission
 False    4079056
 True      206224
 Name: count, dtype: int64)

#### **Q2** What is the timeframe covered by the dataset?
Clarifying the temporal boundaries allows the analysis to be contextualized within specific stages of the COVID-19 pandemic and to trace how discussions evolved alongside major social or political events in 2020.

#### **Q3** What basic information can be summarized about the dataset (columns, data types, missing values)?
A preliminary structural overview is necessary to assess data completeness and technical readiness for analysis, ensuring that variables like dates, authors, and subreddits are consistent and usable.

#### **Q4** Which specific conspiracy theories were most frequently discussed in 2020?
Identifying the dominant themes reveals which narratives gained prominence during the pandemic and highlights shifts in the collective focus of conspiracy communities.

### **Activity and Distribution**

#### **Q5** Which are the top 10 most active subreddits by number of posts/comments?
Mapping activity levels across subreddits makes it possible to pinpoint where conspiracy discussions were most intense and which communities played a central role in shaping discourse.

#### **Q6** Are there significant peaks of activity around major real-world events?
Correlating posting spikes with external events sheds light on how online conspiracy discussions respond to triggers such as lockdown announcements, vaccine news, or political developments.

#### **Q7** How many unique authors are there, and how many contributions did each make? 
Examining author participation helps determine whether discourse was driven by a few prolific individuals or by a larger, more distributed group of users.

#### **Q8** Which subreddits have the highest average score versus the highest post volume?
This comparison exposes differences between popularity and engagement—some communities may generate large amounts of content, while others achieve greater approval or influence per post.

#### **Q9** Do different subreddits experience synchronized activity peaks?
Studying temporal synchronization between communities can indicate information diffusion and interconnection among different conspiracy networks.

#### **Q10** Do the most active authors post in many subreddits or focus on one community?
Analyzing user posting patterns helps identify whether certain participants act as cross-community links or remain confined to specific ideological spaces.

#### **Q11** How does the average score (karma/upvotes) change over time?
Tracking how post scores evolve provides insight into shifting community attitudes and levels of endorsement toward conspiracy-related content throughout the pandemic year.

### **Content Analysis**

#### **Q12** What is the overall sentiment distribution of all comments and submissions?
Sentiment analysis reveals the emotional atmosphere of conspiracy discussions, indicating whether fear, anger, or hope dominated online exchanges in 2020.

#### **Q13** What are the most positive and most negative subreddits overall?
Comparing sentiment across communities makes it possible to identify where discourse tended to be more constructive, aggressive, or despairing, illustrating emotional diversity within the conspiracy sphere.

#### **Q14** Are longer comments more emotionally expressive (stronger positive/negative values)? 
Exploring the relationship between comment length and emotional intensity can show whether detailed engagement corresponds to stronger affective expression.

#### **Q15** How does sentiment vary between submissions and comments? 
Contrasting the tone of original posts with that of replies helps determine whether discussions amplify, neutralize, or challenge the initial emotional framing.

#### **Q16** Do posts with more positive sentiment tend to get higher scores? 
Assessing this relationship clarifies which emotional tones receive greater validation from the community, shedding light on collective preferences for positivity or outrage.

### **Interesring Findings** 

#### **Q17** Can recurring narratives or metaphors be identified? 
Recognizing repeated metaphors and storylines allows us to understand how conspiracy narratives are constructed symbolically, often relying on themes of awakening, deception, or hidden power.

#### **Q18** Which grammatical constructions are most common (imperative, interrogative, emotional)? 
Analyzing sentence structure helps reveal rhetorical strategies—whether users try to command, question, or emotionally appeal to others—to foster belief or participation.

#### **Q19** Has skepticism toward official statistics and mainstream media increased during the pandemic?
Measuring changes in expressions of distrust provides evidence for whether COVID-19 intensified anti-establishment attitudes within conspiracy communities.

#### **Q20** Does the language of users who discuss COVID differ from that of those discussing other conspiracy topics?
Comparing linguistic patterns highlights how the pandemic introduced new vocabularies—medical, scientific, or apocalyptic—and reshaped discourse styles.

#### **Q21** Do topics with religious undertones have longer discussion threads?
Examining thread length in religiously themed posts helps gauge whether spiritual narratives inspire deeper or more sustained engagement.

#### **Q22** Which keywords most strongly co-occur with “COVID” or “virus”?
Identifying keyword co-occurrences uncovers how different ideas—such as “5G,” “vaccine,” “control,” or “Bill Gates”—clustered around the concept of the virus, revealing the structure of pandemic-related conspiracies.