### **Homework #4 ‚Äî Exploratory Data Analysis (EDA)**

**Project:** Analysis of trends in conspiracy theories during 2020 and the influence of the COVID-19 pandemi—Å

**Authors:** Hovoryshcheva Veronika, Morozova Polina

**Team ID:** 15

**Dataset:** normalized_data.csv

**Time spent:** blabla

In [None]:
# start_time = "13:53, 31.10.2025"

#import pandas as pd
#import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns

### **Goal**

The goal of this exploratory data analysis is to understand Reddit discussions about conspiracy theories in 2020, focusing on dataset structure, user activity, and subreddit interactions. It also aims to examine sentiment and emotional tone, identify recurring narratives and key phrases, and explore topic-specific patterns, such as COVID-related discussions and skepticism toward mainstream information. The analysis seeks to reveal dominant narratives, engagement patterns, and language trends across communities.

### **List of the Research Questions**
#### Overview

1. How many total records does the dataset contain, and how are they divided between submissions and comments?
2. What is the timeframe covered by the dataset?
3. What basic information can be summarized about the dataset (columns, data types, missing values)?
4. Which specific conspiracy theories were most frequently discussed in 2020?

#### Activity and Distribution

5. Which are the top 10 most active subreddits by number of posts/comments?
6. Are there significant peaks of activity around major real-world events?
7. How many unique authors are there, and how many contributions did each make?
8. Which subreddits have the highest average score versus the highest post volume?
9. Do different subreddits experience synchronized activity peaks?
10. Do the most active authors post in many subreddits or focus on one community?
11. How does the average score (karma/upvotes) change over time?

#### Sentiment Analysis

12. What is the overall sentiment distribution of all comments and submissions?
13. What are the most positive and most negative subreddits overall?
14. Are longer comments more emotionally expressive (stronger positive/negative values)?
15. How does sentiment vary between submissions and comments?
16. Do posts with more positive sentiment tend to get higher scores?

#### Interesring Findings

17. Can recurring narratives or metaphors be identified?
18. Which grammatical constructions are most common (imperative, interrogative, emotional)?
19. Has skepticism toward official statistics and mainstream media increased during the pandemic?
20. Does the language of users who discuss COVID differ from that of those discussing other conspiracy topics?
21. Do topics with religious undertones have longer discussion threads?
22. Which keywords most strongly co-occur with ‚ÄúCOVID‚Äù or ‚Äúvirus‚Äù?

### **Overview**

#### **Q1** How many total records does the dataset contain, and how are they divided between submissions and comments?
Understanding the overall size and internal balance of the dataset helps evaluate the representativeness of the material and determine whether conspiracy theories spread mainly through original posts or through ongoing discussions in the comments.

In [None]:
import pandas as pd
import plotly.express as px

df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\dataset.csv", low_memory=False)

In [None]:
df["Record Type"] = df["is_submission"].map({True: "Submission", False: "Comment"})

counts = df["Record Type"].value_counts().reset_index()
counts.columns = ["Record Type", "Count"]
counts["Percentage"] = (counts["Count"] / counts["Count"].sum() * 100).round(2)

In [None]:
fig = px.bar(
    counts,
    x="Count",
    y="Record Type",
    orientation="h",
    color="Record Type",
    color_discrete_map={"Submission": "#0c2688", "Comment": "#26A57F"},
    title="Distribution of Submissions vs Comments",
    text=counts.apply(lambda r: f"{r['Count']:,} ({r['Percentage']}%)", axis=1)
)

fig.update_layout(
    xaxis_title="Number of Records",
    yaxis_title="Record Type",
    title_font_size=18,
    plot_bgcolor="white"
)
fig.update_traces(textposition="outside")
fig.show()



#### **Q2** What is the timeframe covered by the dataset?
Clarifying the temporal boundaries allows the analysis to be contextualized within specific stages of the COVID-19 pandemic and to trace how discussions evolved alongside major social or political events in 2020.

In [None]:
df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df.dropna(subset=["created"])


df["month"] = df["created"].dt.month_name().str[:3]
df["year"] = df["created"].dt.year
month_counts = (
    df.groupby(["year", "month"]).size().reset_index(name="Count")
)

month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_counts["month"] = pd.Categorical(month_counts["month"], categories=month_order, ordered=True)
month_counts = month_counts.sort_values(["year", "month"])

In [None]:
month_counts["label"] = month_counts["month"].astype(str)

fig = px.bar(
    month_counts,
    x="label",
    y="Count",
    text="Count",
    color_discrete_sequence=["#0c2688"]
)

fig.update_traces(textposition="outside")

fig.update_layout(
    title="Dataset Timeframe ‚Äî Monthly Distribution of Records",
    xaxis_title="Month",
    yaxis_title="Number of Posts + Comments",
    showlegend=False,
    plot_bgcolor="white",
    title_font_size=18,
    xaxis=dict(
        type="category",
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    )
)

fig.show()

#### **Q3** What basic information can be summarized about the dataset (columns, data types, missing values)?
A preliminary structural overview is necessary to assess data completeness and technical readiness for analysis, ensuring that variables like dates, authors, and subreddits are consistent and usable.

#### **Q4** Which specific conspiracy theories were most frequently discussed in 2020?
Identifying the dominant themes reveals which narratives gained prominence during the pandemic and highlights shifts in the collective focus of conspiracy communities.

In [None]:
import re

groups = {
    "COVID-related": ["virus", "vaccine", "plandemic"],
    "5G Technology": ["5g"],
    "Global Elite": ["elite", "globalists", "bill gates", "epstein", "zuckerburg"],
    "Deep State": ["deep state", "illuminati", "world order", "state"],
    "Mind Control": ["mind control"],
    "Fake News": ["fake news", "truth"]
}

results = []
for theme, words in groups.items():
    total = 0
    for w in words:
        pattern = rf"\b{re.escape(w)}\b"
        total += df["body"].str.contains(pattern, case=False, na=False).sum()
    results.append({"Theme": theme, "Count": total})

themes_df = pd.DataFrame(results).sort_values("Count", ascending=False)



In [None]:
colors = ["#0c2688", "#8d84bd", "#d3c1c1", "#ec9c9d", "#d43d51", "#26A57F"]

fig = px.bar(
    themes_df,
    x="Count",
    y="Theme",
    orientation="h",
    text="Count",
    color_discrete_sequence=[colors[0]],
    title="Most Discussed Conspiracy Theory Groups (2020)"
)

fig.update_traces(textposition="outside")

fig.update_layout(
    xaxis_title="Number of Mentions",
    yaxis_title="Conspiracy Theme Group",
    showlegend=False,
    plot_bgcolor="white",
    title_font_size=18,
    xaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    )
)

fig.show()

### **Activity and Distribution**

#### **Q5** Which are the top 10 most active subreddits by number of posts/comments?
Mapping activity levels across subreddits makes it possible to pinpoint where conspiracy discussions were most intense and which communities played a central role in shaping discourse.

In [None]:
subreddit_counts = df["subreddit"].value_counts().head(10).reset_index()
subreddit_counts.columns = ["Subreddit", "Count"]

colors = ["#0c2688", "#8d84bd", "#d3c1c1", "#ec9c9d", "#d43d51", "#26A57F"]

fig = px.bar(
    subreddit_counts,
    x="Count",
    y="Subreddit",
    orientation="h",
    text="Count",
    color_discrete_sequence=[colors[0]]
)

fig.update_layout(
    title="Top 10 Most Active Subreddits by Posts and Comments",
    xaxis_title="Number of Posts + Comments",
    yaxis_title="Subreddit",
    showlegend=False,
    plot_bgcolor="white",
    title_font_size=18,
    xaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    ),
    yaxis=dict(
        autorange="reversed",
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    )
)
fig.update_traces(textposition="outside")
fig.show()


#### **Q6** Are there significant peaks of activity around major real-world events?
Correlating posting spikes with external events sheds light on how online conspiracy discussions respond to triggers such as lockdown announcements, vaccine news, or political developments.

In [None]:
df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df.dropna(subset=["created"])

df["month"] = df["created"].dt.to_period("M")

monthly_counts = (
    df.groupby(["month", "is_submission"])
    .size()
    .reset_index(name="Count")
)

monthly_counts["month"] = monthly_counts["month"].astype(str)

pivot = monthly_counts.pivot(index="month", columns="is_submission", values="Count").fillna(0)
pivot.columns = ["Comments", "Submissions"]
pivot = pivot.reset_index()

In [None]:
import plotly.graph_objects as go
from datetime import datetime
import time

colors = ["#0c2688", "#8d84bd", "#d1bfbf", "#ec9c9d", "#d43d51", "#26A57F"]

df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df[(df["created"] >= "2020-01-01") & (df["created"] < "2021-01-01")]

df["Submissions"] = df["is_submission"].astype(int)
df["Comments"] = (~df["is_submission"]).astype(int)

weekly = df.set_index("created").resample("W").agg(
    Submissions=("Submissions", "sum"),
    Comments=("Comments", "sum")
).reset_index()

weekly["Total"] = weekly["Submissions"] + weekly["Comments"]
weekly["Total"] = weekly["Total"].interpolate()


fig = go.Figure()

fig.add_trace(go.Scatter(
    x=weekly["created"],
    y=weekly["Total"],
    mode="lines",
    name="Total Activity",
    line=dict(color=colors[0], width=3)
))

events = {
    "2020-03-11": "WHO declares <br> pandemic",
    "2020-04-01": "First <br> lockdowns",
    "2020-12-15": "Vaccine <br> rollout begins",
    "2020-05-25": "George Floyd‚Äôs <br> murder in Minneapolis",
    "2020-07-15": "Massive Twitter hack",
    "2020-07-02": "Ghislaine Maxwell<br>arrested<br>.",
    "2020-11-03": "U.S. Presidential Election"
}

for d, label in events.items():
    ts = time.mktime(datetime.strptime(d, "%Y-%m-%d").timetuple()) * 1000
    if d < "2021-01-01":
        fig.add_vline(
            x=ts,
            line_width=1.5,
            line_dash="dash",
            line_color=colors[4],
            annotation_text=label,
            annotation_position="top",
            annotation_font=dict(size=11, color=colors[4])
        )
fig.add_vrect(
    x0="2020-03-23", x1="2020-03-31",
    fillcolor="gray",
    opacity=0.15,
    layer="below",
    line_width=0,
    annotation_text="Data missing",
    annotation_position="top right",
    annotation_font=dict(size=11, color="gray")
)
fig.add_vrect(
    x0="2020-07-01", x1="2020-08-01",
    fillcolor="blue",
    opacity=0.15,
    layer="below",
    line_width=0
)
fig.add_annotation(
    x="2020-07-15",
    y=0.03, yref="paper",
    text="Epstein‚ÄìMaxwell case period",
    showarrow=False,
    font=dict(size=11, color="blue"),
    yanchor="bottom"
)

fig.add_vrect(
    x0="2020-10-15", x1="2020-11-15",
    fillcolor="red",
    opacity=0.15,
    layer="below",
    line_width=0
)
fig.add_annotation(
    x="2020-10-30",
    y=0.03, yref="paper",
    text="U.S. Election period",
    showarrow=False,
    font=dict(size=11, color="red"),
    yanchor="bottom"
)




fig.update_layout(
    title="Weekly Activity ‚Äî Posts and Comments During 2020",
    xaxis_title="Month",
    yaxis_title="Number of Records",
    plot_bgcolor="white",
    title_font_size=18,
    legend=dict(
        title=None,
        orientation="h",
        yanchor="bottom", y=1.02,
        xanchor="right", x=1
    ),
    xaxis=dict(
        type="date",
        dtick="M1",
        tickformat="%b %Y",
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False,
        range=["2020-01-01", "2020-12-31"]
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        gridwidth=1,
        zeroline=False
    )
)

fig.show()


#### **Q7** How many unique authors are there, and how many contributions did each make? 
Examining author participation helps determine whether discourse was driven by a few prolific individuals or by a larger, more distributed group of users.

In [None]:
import pandas as pd
import plotly.express as px 


counts = df["author"].value_counts(dropna=False)
author_counts = counts.rename_axis("Author").reset_index(name="Contributions")

bins = author_counts["Contributions"].value_counts().sort_index().reset_index()
bins.columns = ["Contributions", "Authors"]
bins["Contributions"] = pd.to_numeric(bins["Contributions"], errors="coerce")
bins["Authors"] = pd.to_numeric(bins["Authors"], errors="coerce")

ranges = [(11,20), (21,30), (31,40), (41,50), (51,100), (101, 1000)]
group_rows = []
for r in ranges:
    low, high = r
    group_rows.append({
        "Contributions": f"{low}-{high}",
        "Authors": bins.loc[
            (bins["Contributions"] >= low) & (bins["Contributions"] <= high),
            "Authors"
        ].sum()
    })

over_1000 = bins.loc[bins["Contributions"] > 1000, "Authors"].sum()
group_rows.append({"Contributions": ">1000", "Authors": over_1000})

bins_trimmed = bins.loc[bins["Contributions"] <= 10].copy()
bins_final = pd.concat([bins_trimmed, pd.DataFrame(group_rows)], ignore_index=True)
bins_final["Authors"] = pd.to_numeric(bins_final["Authors"], errors="coerce")
bins_final["Contributions"] = bins_final["Contributions"].astype(str)
order = [str(i) for i in range(1, 11)] + [f"{low}-{high}" for (low, high) in ranges] + [">1000"]

fig = px.bar(
    bins_final,
    x="Contributions",
    y="Authors",
    category_orders={"Contributions": order},
    color_discrete_sequence=["#0c2688"],
    text="Authors",
    title="Number of Authors by Number of Contributions"
)

fig.update_traces(
    texttemplate="%{text:,}",
    textposition="outside",
    marker_line_width=0.5,
    marker_line_color="white",
    opacity=0.9
)

fig.update_layout(
    xaxis_title="Number of Contributions (posts + comments)",
    yaxis_title="Number of Authors",
    plot_bgcolor="white",
    title_font_size=18,
    xaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1, zeroline=False),
    yaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1, zeroline=False)
)

fig.show()


In [None]:
groups = [
    ("1", bins.loc[bins["Contributions"] == 1, "Authors"].sum()),
    ("2‚Äì5", bins.loc[(bins["Contributions"] >= 2) & (bins["Contributions"] <= 5), "Authors"].sum()),
    ("6‚Äì10", bins.loc[(bins["Contributions"] >= 6) & (bins["Contributions"] <= 10), "Authors"].sum()),
    ("11‚Äì50", bins.loc[(bins["Contributions"] >= 11) & (bins["Contributions"] <= 50), "Authors"].sum()),
    ("51+", bins.loc[bins["Contributions"] > 50, "Authors"].sum())
]

pie_data = pd.DataFrame(groups, columns=["Contribution Range", "Authors"])

colors = ["#0c2688", "#8d84bd", "#ceaae6", "#ec9c9d", "#d43d51"]

fig = px.pie(
    pie_data,
    names="Contribution Range",
    values="Authors",
    color_discrete_sequence=colors,
    title="Share of Authors by Contribution Range"
)

fig.update_traces(
    textinfo="label+percent",
    insidetextorientation="radial",
    textfont_size=13
)

fig.update_layout(
    title_font_size=18,
    plot_bgcolor="white"
)

fig.show()


#### **Q8** Which subreddits have the highest average score versus the highest post volume?
This comparison exposes differences between popularity and engagement‚Äîsome communities may generate large amounts of content, while others achieve greater approval or influence per post.

In [None]:
import plotly.express as px
import numpy as np

subreddit_stats = (
    df.groupby("subreddit")
    .agg(avg_score=("score", "mean"), post_volume=("subreddit", "count"))
    .reset_index()
)

volume_cutoff = np.percentile(subreddit_stats["post_volume"], 99)
score_cutoff = np.percentile(subreddit_stats["avg_score"], 99)
filtered = subreddit_stats[
    (subreddit_stats["post_volume"] <= volume_cutoff) &
    (subreddit_stats["avg_score"] <= score_cutoff)
]

vmin, vmax = np.percentile(filtered["avg_score"], [5, 95])

fig = px.scatter(
    filtered,
    x="post_volume",
    y="avg_score",
    size="post_volume",
    color="avg_score",
    hover_name="subreddit",
    color_continuous_scale=[
        "#0c2688", "#3855b2", "#8d84bd", "#ec9c9d", "#d43d51"
    ],
    range_color=[vmin, vmax],
    title="Average Score vs Post Volume by Subreddit (Focused View)"
)

fig.update_xaxes(
    type="log",
    title="Post Volume (log scale)",
    range=[np.log10(filtered["post_volume"].min()), np.log10(volume_cutoff)]
)
fig.update_yaxes(
    title="Average Score",
    range=[filtered["avg_score"].min(), score_cutoff]
)

fig.update_layout(
    plot_bgcolor="white",
    title_font_size=18,
    coloraxis_colorbar=dict(
        title="Avg Score",
        tickvals=np.linspace(vmin, vmax, 6).round(1)
    ),
    xaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1),
    yaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1)
)

fig.show()


In [None]:
import plotly.express as px
import numpy as np

subreddit_stats = (
    df.groupby("subreddit")
    .agg(avg_score=("score", "mean"), post_volume=("subreddit", "count"))
    .reset_index()
)

volume_cutoff = np.percentile(subreddit_stats["post_volume"], 99)
score_cutoff = np.percentile(subreddit_stats["avg_score"], 99)
filtered = subreddit_stats[
    (subreddit_stats["post_volume"] <= volume_cutoff) &
    (subreddit_stats["avg_score"] <= score_cutoff)
]

vmin, vmax = np.percentile(filtered["avg_score"], [5, 95])

fig = px.scatter(
    filtered,
    x="post_volume",
    y="avg_score",
    size="post_volume",
    color="avg_score",
    hover_name="subreddit",
    color_continuous_scale=[
        "#0c2688", "#3855b2", "#8d84bd", "#ec9c9d", "#d43d51"
    ],
    range_color=[vmin, vmax],
    title="Average Score vs Post Volume by Subreddit (Focused View)"
)

fig.update_xaxes(
    type="log",
    title="Post Volume (log scale)",
    range=[np.log10(filtered["post_volume"].min()), np.log10(volume_cutoff)]
)
fig.update_yaxes(
    title="Average Score",
    range=[filtered["avg_score"].min(), score_cutoff]
)

fig.update_layout(
    plot_bgcolor="white",
    title_font_size=18,
    coloraxis_colorbar=dict(
        title="Avg Score",
        tickvals=np.linspace(vmin, vmax, 6).round(1)
    ),
    xaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1),
    yaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1)
)

fig.show()


#### **Q9** Do different subreddits experience synchronized activity peaks?
Studying temporal synchronization between communities can indicate information diffusion and interconnection among different conspiracy networks.

In [None]:
df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df.dropna(subset=["created"])

top_subs = df["subreddit"].value_counts().head(6).index
df_top = df[df["subreddit"].isin(top_subs)].copy()

df_top.loc[:, "month"] = df_top["created"].apply(lambda x: x.strftime("%Y-%m"))
activity = df_top.groupby(["month", "subreddit"]).size().reset_index(name="ActivityCount")



In [None]:
colors = ["#0c2688", "#8d84bd", "#26A57F", "#ec9c9d", "#d43d51", "#e3b23c"]
activity["month"] = activity["month"].astype(str)

fig = px.line(
    activity,
    x="month",
    y="ActivityCount",
    color="subreddit",
    color_discrete_sequence=colors,
    title="Activity Trends of Top Subreddits Over Time (Log Scale)"
)

fig.update_traces(mode="lines+markers", line=dict(width=2))

fig.update_layout(
    xaxis_title="Month (2020)",
    yaxis_title="Number of Posts / Comments (log scale)",
    plot_bgcolor="white",
    title_font_size=18,
    legend_title_text="Subreddit",
    yaxis_type="log",
    xaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1),
    yaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1)
)

events = {
    "2020-04-01": "First lockdowns",
    "2020-07-02": "Ghislaine Maxwell arrested",
    "2020-11-03": "US Presidential Election"
}

for date, label in events.items():
    date_dt = pd.to_datetime(date)
    fig.add_shape(
        type="line",
        x0=date_dt,
        x1=date_dt,
        y0=0,
        y1=1,
        xref="x",
        yref="paper",
        line=dict(color="#d43d51", width=1.5, dash="dash")
    )
    fig.add_annotation(
        x=date_dt,
        y=1,
        yref="paper",
        text=label,
        showarrow=False,
        xanchor="left",
        yanchor="bottom",
        font=dict(size=11, color="#d43d51")
    )

fig.show()

#### **Q10** Do the most active authors post in many subreddits or focus on one community?
Analyzing user posting patterns helps identify whether certain participants act as cross-community links or remain confined to specific ideological spaces.

In [None]:
import numpy as np
activity = (
    df.groupby("author")
    .agg(total_posts=("body", "count"), unique_subs=("subreddit", "nunique"))
    .reset_index()
)
activity = activity[activity["total_posts"] > 0]

activity["total_posts"] = np.clip(activity["total_posts"], 1, 5000)
activity["unique_subs"] = np.clip(activity["unique_subs"], 1, 50)

rng = np.random.default_rng(42)
activity["unique_subs_jitter"] = activity["unique_subs"] + rng.normal(0, 0.05, len(activity))


In [None]:
import networkx as nx
import plotly.graph_objects as go
import numpy as np

top_subs = df["subreddit"].value_counts().head(10).index
top_authors = df["author"].value_counts().head(300).index
filtered = df[df["subreddit"].isin(top_subs) & df["author"].isin(top_authors)]

G = nx.Graph()
for _, row in filtered.iterrows():
    G.add_node(row["author"], type="author")
    G.add_node(row["subreddit"], type="subreddit")
    G.add_edge(row["author"], row["subreddit"])

authors_low, authors_mid, authors_high = [], [], []
for n in G.nodes():
    if G.nodes[n]["type"] == "author":
        degree = len(list(G.neighbors(n)))
        if degree > 5:
            authors_high.append(n)
        elif degree > 3:
            authors_mid.append(n)
        else:
            authors_low.append(n)

subreddits = [n for n, d in G.nodes(data=True) if d["type"] == "subreddit"]

pos = nx.shell_layout(G, nlist=[subreddits, authors_high, authors_mid, authors_low], scale=2)
for n in subreddits:
    pos[n] = pos[n] * 1.5  

x_edges, y_edges = [], []
for edge in G.edges():
    x_edges += [pos[edge[0]][0], pos[edge[1]][0], None]
    y_edges += [pos[edge[0]][1], pos[edge[1]][1], None]

palette = ["#12258f", "#9084c0", "#f1f1f1", "#f09fa2", "#de425b"]
colors = {
    "author_low":  palette[0],
    "author_mid":  palette[1],
    "author_high": palette[3],
    "subreddit":   palette[4],
}

node_colors, node_sizes = [], []
author_categories = {"‚â§3": 0, "4‚Äì5": 0, ">5": 0}

for n in G.nodes():
    degree = len(list(G.neighbors(n)))
    if G.nodes[n]["type"] == "subreddit":
        node_colors.append(colors["subreddit"])
        node_sizes.append(np.log1p(degree) * 14)
    else:
        if degree > 5:
            node_colors.append(colors["author_high"])
            author_categories[">5"] += 1
        elif degree > 3:
            node_colors.append(colors["author_mid"])
            author_categories["4‚Äì5"] += 1
        else:
            node_colors.append(colors["author_low"])
            author_categories["‚â§3"] += 1
        node_sizes.append(np.log1p(degree) * 9)

total_authors = sum(author_categories.values())
author_percentages = {
    group: round((count / total_authors) * 100, 1)
    for group, count in author_categories.items()
}

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x_edges, y=y_edges,
    mode="lines",
    line=dict(width=0.7, color="rgba(0,0,0,0.25)"),
    hoverinfo="none",
    showlegend=False
))

fig.add_trace(go.Scatter(
    x=[pos[n][0] for n in G.nodes()],
    y=[pos[n][1] for n in G.nodes()],
    mode="markers",
    marker=dict(
        size=node_sizes,
        color=node_colors,
        opacity=0.9,
        line=dict(width=0.8, color="white")
    ),
    hovertext=[
        f"<b>{n}</b><br>Type: {G.nodes[n]['type']}<br>Connections: {len(list(G.neighbors(n)))}"
        for n in G.nodes()
    ],
    hoverinfo="text",
    showlegend=False
))


fig.add_trace(go.Scatter(
    x=[pos[n][0] for n in subreddits],
    y=[pos[n][1] for n in subreddits],
    mode="text",
    text=subreddits,
    textposition="top center",
    textfont=dict(size=14, color="rgba(0,0,0,0.85)"),
    hoverinfo="none",
    showlegend=False
))

fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color=colors["author_low"]),
    name=f'Author ‚â§3 subs ({author_percentages["‚â§3"]}%)'))
fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color=colors["author_mid"]),
    name=f'Author 4‚Äì5 subs ({author_percentages["4‚Äì5"]}%)'))
fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color=colors["author_high"]),
    name=f'Author >5 subs ({author_percentages[">5"]}%)'))
fig.add_trace(go.Scatter(x=[None], y=[None], mode='markers',
    marker=dict(size=10, color=colors["subreddit"]),
    name="Subreddit"))

fig.update_layout(
    title="Network of Authors and Subreddits ‚Äî Cross-Posting Patterns",
    title_font_size=18,
    plot_bgcolor="white",
    showlegend=True,
    legend=dict(
        orientation="h",
        yanchor="bottom", y=-0.15,
        xanchor="center", x=0.5,
        font=dict(size=12)
    ),
    xaxis=dict(showgrid=False, zeroline=False, visible=False),
    yaxis=dict(showgrid=False, zeroline=False, visible=False),
    margin=dict(l=40, r=40, t=70, b=50)
)

fig.show()


#### **Q11** How does the average score (karma/upvotes) change over time?
Tracking how post scores evolve provides insight into shifting community attitudes and levels of endorsement toward conspiracy-related content throughout the pandemic year.

In [None]:
df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df.dropna(subset=["created", "score"])

df["month"] = df["created"].dt.to_period("M").dt.to_timestamp()

avg_scores = (
    df.groupby("month")["score"]
    .mean()
    .reset_index()
    .sort_values("month")
)

In [None]:
colors = ["#0c2688", "#8d84bd", "#d43d51", "#26A57F", "#ec9c9d", "#e3b23c"]


fig = px.area(
    avg_scores,
    x="month",
    y="score",
    color_discrete_sequence=[colors[0]],
    title="Average Post/Comment Score Over Time (Density Plot)",
    labels={
        "month": "Month (2020)",
        "score": "Average Score (Karma/Upvotes)"
    }
)

fig.update_traces(
    line=dict(width=3, color=colors[0]),
    fill='tozeroy',
    opacity=0.5
)

fig.update_layout(
    plot_bgcolor="white",
    title_font_size=18,
    xaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        zeroline=False
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="rgba(0,0,0,0.1)",
        zeroline=False
    ),
    font=dict(family="Open Sans", size=13, color="#333"),
    margin=dict(l=60, r=40, t=80, b=60)
)

fig.show()


### **Content Analysis**

#### **Q12** What is the overall sentiment distribution of all comments and submissions?
Sentiment analysis reveals the emotional atmosphere of conspiracy discussions, indicating whether fear, anger, or hope dominated online exchanges in 2020.

In [None]:
df_sent = pd.read_csv("sentiment.csv")

emotion_cols = [col for col in df_sent.columns if col in [
    "anger", "anticipation", "disgust", "fear", "joy",
    "sadness", "surprise", "trust"
]]

emotion_summary = df_sent[emotion_cols].sum().sort_values(ascending=False).reset_index()
emotion_summary.columns = ["Emotion", "Count"]
emotion_summary["Percent"] = emotion_summary["Count"] / emotion_summary["Count"].sum() * 100

In [None]:
colors = ["#9b5bbb", "#36553C", "#d4b544", "#bcd377", "#4436ad", "#5DB2B1", "#8e2424", "#127950"]

fig = px.bar(
    emotion_summary,
    x="Emotion",
    y="Percent",
    color="Emotion",
    color_discrete_sequence=colors,
    title="Overall Emotion Distribution in Reddit Dataset"
)

fig.update_layout(
    xaxis_title="Emotion",
    yaxis_title="Percentage of Emotion Words",
    plot_bgcolor="white",
    title_font_size=18,
    showlegend=False,
    xaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1),
    yaxis=dict(showgrid=True, gridcolor="rgba(0,0,0,0.1)", gridwidth=1)
)

fig.show()


The overall sentiment distribution of Reddit discussions about conspiracy narratives in 2020 shows a strong emotional polarization. 

Negative sentiment dominates the conversations, slightly surpassing positive sentiment, while neutral posts form a much smaller portion. This indicates that conspiracy-related discussions were highly emotionally charged, with fear, distrust, and anger being particularly widespread. 

The sizable share of positive sentiment suggests the presence of hopeful or supportive narratives within these communities as well, but the prevalence of negativity reflects heightened anxiety and conflict during this period of global uncertainty.

#### **Q13** What are the most positive and most negative subreddits overall?
Comparing sentiment across communities makes it possible to identify where discourse tended to be more constructive, aggressive, or despairing, illustrating emotional diversity within the conspiracy sphere.

#### **Q14** Are longer comments more emotionally expressive (stronger positive/negative values)? 
Exploring the relationship between comment length and emotional intensity can show whether detailed engagement corresponds to stronger affective expression.

#### **Q15** How does sentiment vary between submissions and comments? 
Contrasting the tone of original posts with that of replies helps determine whether discussions amplify, neutralize, or challenge the initial emotional framing.

#### **Q16** Do posts with more positive sentiment tend to get higher scores? 
Assessing this relationship clarifies which emotional tones receive greater validation from the community, shedding light on collective preferences for positivity or outrage.

### **Interesring Findings** 

#### **Q17** Can recurring narratives or metaphors be identified? 
Recognizing repeated metaphors and storylines allows us to understand how conspiracy narratives are constructed symbolically, often relying on themes of awakening, deception, or hidden power.

In [None]:
!pip install python-louvain

In [None]:
import matplotlib.pyplot as plt
import re
import networkx as nx
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import community.community_louvain as community_louvain

nltk.download('stopwords')
stop = set(nltk.corpus.stopwords.words("english"))

# load only the needed columns, faster
df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\dataset.csv", usecols=["body"])

# custom conspiracy vocabulary
conspiracy_terms = [
    "deep state","truth","wake up","they control","hidden","elite",
    "illuminati","nwo","new world order","great awakening",
    "plandemic","sheeple","mind control","5g","fake news",
    "qanon","cabal","puppet","big pharma","globalists",
]

# keep only rows containing conspiratorial themes
pattern = r"|".join([re.escape(x) for x in conspiracy_terms])
df = df[df["body"].str.contains(pattern, case=False, na=False)]

# cleaning function
def clean(t):
    t = str(t).lower()
    t = re.sub(r"http\S+|www\S+", " ", t)
    t = re.sub(r"[^a-z\s]", " ", t)
    words = [w for w in t.split() if w not in stop and len(w) > 2]
    return " ".join(words)

df["clean"] = df["body"].apply(clean)

# bigram extractor
vectorizer = CountVectorizer(ngram_range=(2,2), min_df=15)
X = vectorizer.fit_transform(df["clean"])

counts = X.sum(axis=0).A1
bigrams = vectorizer.get_feature_names_out()
bigram_freq = dict(zip(bigrams, counts))

# sort by conspiracy relevance (keep only phrases that contain keywords)
filtered_bigrams = {
    bg:freq for bg,freq in bigram_freq.items()
    if any(key in bg for key in ["state","truth","wake","control","elite","order","5g","qanon","virus","plandemic"])
}

top_bigrams = Counter(filtered_bigrams).most_common(25)

print("Top conspiracy bigrams:\n")
for phrase, freq in top_bigrams:
    print(f"{phrase:35s} {freq}")

In [None]:
import plotly.graph_objects as go
import networkx as nx

# --- build conspiracy graph from bigrams ---
G = nx.Graph()

for phrase, freq in top_bigrams:
    w1, w2 = phrase.split()
    G.add_edge(w1, w2, weight=freq)

# --- Layout ---
pos = nx.spring_layout(G, k=20, iterations=200, seed=42)

# --- Edges ---
edge_x, edge_y = [], []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x += [x0, x1, None]
    edge_y += [y0, y1, None]

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=1.5, color='rgba(160,160,160,0.4)'),
    hoverinfo='none',
    mode='lines'
)

# --- Nodes ---
node_x, node_y = [], []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)

node_adjacencies = []
node_text = []

for node, adjacencies in G.adjacency():
    deg = len(adjacencies)
    node_adjacencies.append(deg)
    node_text.append(f"{node}<br>Connections: {deg}")

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers+text',
    hoverinfo='text',
    text=[n for n in G.nodes()],
    textposition="top center",
    textfont=dict(size=11, color="black"),
    marker=dict(
        showscale=True,
        colorscale='Bluered',
        color=node_adjacencies,
        size=[12 + d * 3 for d in node_adjacencies],
        line=dict(width=1, color='white'),
        colorbar=dict(
            thickness=12,
            title='Connections',
            xanchor='left'
        )
    )
)

# --- Final figure ---
fig = go.Figure(
    data=[edge_trace, node_trace],
    layout=go.Layout(
        title="<b>Conspiracy Narratives ‚Äî Network of Recurring Bigrams (Reddit 2020)</b>",
        titlefont_size=20,
        showlegend=False,
        hovermode='closest',
        margin=dict(b=20, l=20, r=20, t=60),
        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        plot_bgcolor="white"
    )
)

fig.update_traces(textposition="top center")  
fig.show()


Based on the 2020 Reddit data, the network graph shows that the concept of ‚Äútruth‚Äù serves as the most central and highly connected hub within these conspiracy narratives.

It functions as a core anchor that links directly to ideas related to seeking, knowing, and revealing ‚Äî such as ‚Äúfind,‚Äù ‚Äúknow,‚Äù ‚Äútell,‚Äù ‚Äúsee,‚Äù and ‚Äúwant.‚Äù

The visualization also highlights several distinct thematic clusters, including:
- Global Control: ‚Äúgreat elite,‚Äù ‚Äúmind control,‚Äù ‚Äúglobal virus,‚Äù ‚Äúdeep state‚Äù
- New World Order: ‚Äúorder submission,‚Äù ‚Äúworld‚Äù
- Call to Action: ‚Äúwake,‚Äù ‚Äúunited,‚Äù ‚Äúpeople,‚Äù ‚Äúspeak‚Äù

Despite their differences, all these clusters are tied back to the central pursuit of truth, suggesting that it acts as the unifying concept across a wide range of conspiracy themes.

#### **Q18** Which grammatical constructions are most common (imperative, interrogative, emotional)? 
Analyzing sentence structure helps reveal rhetorical strategies‚Äîwhether users try to command, question, or emotionally appeal to others‚Äîto foster belief or participation.

#### **Q19** Has skepticism toward official statistics and mainstream media increased during the pandemic?
Measuring changes in expressions of distrust provides evidence for whether COVID-19 intensified anti-establishment attitudes within conspiracy communities.

In [None]:
import pandas as pd
import re
import plotly.graph_objects as go

# --- Load dataset ---
df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\dataset.csv", low_memory=False)

# --- Keywords related to skepticism ---
keywords = ["fake news", "lies", "propaganda", "mainstream media"]
pattern = re.compile(r'\b(?:' + '|'.join([re.escape(k) for k in keywords]) + r')\b', re.IGNORECASE)

# --- Ensure 'created' column exists and convert to datetime ---
if 'created' in df.columns:
    df['date'] = pd.to_datetime(df['created'], errors='coerce')
elif 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
else:
    raise KeyError("DataFrame must have a date column ('created' or 'date').")

# --- Create 'month' column ---
df['month'] = df['date'].dt.to_period('M')

# --- Create separate columns for each keyword ---
for kw in keywords:
    df[kw] = df['body'].astype(str).str.count(re.escape(kw), flags=re.IGNORECASE)

# --- Sum per month for each keyword ---
monthly_keywords = df.groupby('month')[keywords].sum().reset_index()
monthly_keywords['month'] = monthly_keywords['month'].dt.to_timestamp()

# --- Ensure numeric data ---
for kw in keywords:
    monthly_keywords[kw] = pd.to_numeric(monthly_keywords[kw], errors='coerce').fillna(0).astype(int)

In [None]:
# --- Define color palette ---
colors = ["#eff3ff", "#bdd7e7", "#6baed6", "#2171b5"]

# --- Create filled area chart with Plotly ---
fig = go.Figure()

for i, kw in enumerate(keywords):
    fig.add_trace(go.Scatter(
        x=monthly_keywords['month'],
        y=monthly_keywords[kw],
        name=kw,
        mode='lines',
        line=dict(width=0.5, color=colors[i]),
        stackgroup='one',  # enables stacked area
        groupnorm='',
        fillcolor=colors[i]
    ))

# --- Customize layout ---
fig.update_layout(
    title='Skepticism toward Official Statistics and Mainstream Media in 2020',
    xaxis_title='Month',
    yaxis_title='Number of Mentions',
    template='plotly_white',
    legend_title_text='Keywords',
    font=dict(size=13),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=True, zeroline=False)
)

# --- Show interactive chart ---
fig.show()


The data indicate that skepticism toward official statistics and mainstream media experienced noticeable fluctuations throughout 2020, closely corresponding with key phases of the COVID-19 pandemic. In the early months of the year, public discussion around misinformation and distrust remained relatively stable. However, beginning in March 2020‚Äîwhen the pandemic was officially declared‚Äîthere was a clear increase in the number of mentions of terms such as ‚Äúfake news,‚Äù ‚Äúpropaganda,‚Äù ‚Äúlies,‚Äù and ‚Äúmainstream media.‚Äù This suggests that the health crisis and the accompanying information overload amplified public doubts about the credibility of institutional communication and media reporting.

While this surge slightly declined in mid-2020, skepticism remained consistently higher than pre-pandemic levels, with another peak visible toward the end of the year. This secondary rise likely reflects renewed tensions around political polarization, vaccine announcements, and continued debates over misinformation.

Overall, the evidence supports the conclusion that the COVID-19 pandemic acted as a catalyst for growing distrust in traditional information sources. The sustained visibility of skeptical discourse throughout 2020 illustrates how global crises can intensify public questioning of authority, expertise, and the reliability of official narratives.

#### **Q20** Does the language of users who discuss COVID differ from that of those discussing other conspiracy topics?
Comparing linguistic patterns highlights how the pandemic introduced new vocabularies‚Äîmedical, scientific, or apocalyptic‚Äîand reshaped discourse styles.

In [None]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

# --- 0. –û—á–∏—â–µ–Ω–Ω—è –¥–∞–Ω–∏—Ö ---
# –∑–∞–º—ñ–Ω–∞ NaN —ñ —á–∏—Å–µ–ª –Ω–∞ —Ä—è–¥–∫–∏, —â–æ–± —É–Ω–∏–∫–Ω—É—Ç–∏ TypeError
df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\dataset.csv", low_memory=False)
df['body'] = df['body'].fillna('').astype(str)

# --- 1. Identify COVID-related posts ---
covid_keywords = ["covid", "coronavirus", "pandemic", "vaccine", "vaccines", "lockdown", "virus", "pfizer"]
pattern = r'\b(' + '|'.join(covid_keywords) + r')\b'

df['category'] = df['body'].str.lower().apply(
    lambda x: 'covid' if re.search(pattern, x) else 'non-covid'
)

# --- 2. Text cleaning ---
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

df['clean'] = df['body'].apply(clean_text)

#--- 3. Split into two groups ---
covid_texts = df[df['category'] == 'covid']['clean']
other_texts = df[df['category'] == 'non-covid']['clean']

# --- 4. Count word frequencies ---
vectorizer = CountVectorizer(max_features=20, stop_words='english')
covid_counts = np.sum(vectorizer.fit_transform(covid_texts).toarray(), axis=0)
covid_vocab = vectorizer.get_feature_names_out()

vectorizer = CountVectorizer(vocabulary=covid_vocab, stop_words='english')
other_counts = np.sum(vectorizer.fit_transform(other_texts).toarray(), axis=0)

# --- 5. Prepare DataFrame for heatmap ---
heatmap_df = pd.DataFrame({
    'COVID': covid_counts,
    'Non-COVID': other_counts
}, index=covid_vocab)

# –Ω–æ—Ä–º–∞–ª—ñ–∑—É—î–º–æ –¥–ª—è –ø–æ—Ä—ñ–≤–Ω—è–Ω–Ω—è
#heatmap_df = heatmap_df.div(heatmap_df.sum(axis=0), axis=1)
# --- –Ω–æ—Ä–º–∞–ª—ñ–∑–∞—Ü—ñ—è –¥–æ –≤—ñ–¥–Ω–æ—Å–Ω–∏—Ö —á–∞—Å—Ç–æ—Ç ---
heatmap_df = heatmap_df.div(heatmap_df.sum(axis=0), axis=1) * 100  

In [None]:
import plotly.graph_objects as go

# --- Build interactive heatmap with Plotly ---
fig = go.Figure(data=go.Heatmap(
    z=heatmap_df.values,                # matrix of values
    x=heatmap_df.columns,               # category labels
    y=heatmap_df.index,                 # word labels
    colorscale='YlGnBu',                # same color palette as Seaborn
    text=heatmap_df.round(1),           # annotations with 1 decimal
    texttemplate="%{text}",             # show annotations
    colorbar=dict(title='Frequency (%)')  # colorbar title
))

# --- Layout customization ---
fig.update_layout(
    title="Word Frequency Comparison: COVID vs Non-COVID Conspiracies",
    xaxis_title="Category",
    yaxis_title="Words",
    template='plotly_white',
    font=dict(size=12),
    width=900,
    height=600
)

# --- Show interactive chart ---
fig.show()


The comparison of word frequencies between COVID-related and non-COVID conspiracy discussions reveals distinct linguistic patterns that reflect the specific focus and context of each topic. In COVID-related conspiracies, users more frequently use terms such as ‚Äúvirus,‚Äù ‚Äúvaccine,‚Äù ‚Äúvaccines,‚Äù and ‚Äúcovid,‚Äù which directly reference the health crisis and associated medical themes. This indicates a discourse centered on public health, disease transmission, and skepticism toward scientific or governmental handling of the pandemic.

Conversely, non-COVID conspiracy discussions are dominated by more general or socially oriented language, with higher frequencies for words such as ‚Äúpeople,‚Äù ‚Äúlike,‚Äù ‚Äújust,‚Äù and ‚Äúthink.‚Äù These words suggest broader discussions about human behavior, belief systems, and opinion-sharing rather than specific scientific or epidemiological issues.

Overall, the linguistic divergence suggests that while COVID-related conspiracies are grounded in biomedical and institutional skepticism, non-COVID conspiracies tend to emphasize social dynamics, perception, and general distrust. This difference highlights how global crises like the pandemic reshape not only the content but also the linguistic framing of conspiratorial discourse.

#### **Q21** Did global events of 2020 (the pandemic, protests, elections, and vaccination) influence the rise or decline of religious rhetoric in conspiracy theories?
This research helps us understand how global crises and social upheavals shape the language and themes of conspiracy narratives. By analyzing the fluctuations in religious references, we can see how faith-based explanations emerge as coping mechanisms during uncertainty and how religion becomes intertwined with misinformation or ideological polarization.

In [None]:
import pandas as pd
import plotly.graph_objects as go

# --- –ó–∞–≤–∞–Ω—Ç–∞–∂–µ–Ω–Ω—è ---
df = pd.read_csv(r"C:\Users\User\Desktop\uni\CSS\project\dataset.csv", low_memory=False)

# --- –ö–ª—é—á–æ–≤—ñ —Å–ª–æ–≤–∞ ---
religious_keywords = [
    "god", "jesus", "christ", "bible", "faith", "church", "pray", 
    "religion", "satan", "prophecy", "heaven", "hell", "angel", "devil"
]

# --- –ü–æ–ø–µ—Ä–µ–¥–Ω—è –æ–±—Ä–æ–±–∫–∞ ---
df["body"] = df["body"].astype(str).str.lower()
df["created"] = pd.to_datetime(df["created"], errors="coerce")
df = df.dropna(subset=["created"])
df = df[df["created"].dt.year == 2020]

# --- –û–∑–Ω–∞–∫–∞ —Ä–µ–ª—ñ–≥—ñ–π–Ω–∏—Ö —Ç–µ–∫—Å—Ç—ñ–≤ ---
pattern = "|".join(religious_keywords)
df["is_religious"] = df["body"].str.contains(pattern, case=False, regex=True)

# --- –ê–≥—Ä–µ–≥–∞—Ü—ñ—è –ø–æ —Ç–∏–∂–Ω—è—Ö ---
df["week"] = df["created"].dt.to_period("W").apply(lambda r: r.start_time)
weekly = df.groupby("week")["is_religious"].mean().reset_index()
weekly.columns = ["week", "religious_ratio"]



In [None]:
# --- –û—Å–Ω–æ–≤–Ω—ñ –ø–æ–¥—ñ—ó ---
events = {
    "2020-02-17": "Christian feast day",
    "2020-03-11": "COVID-19 pandemic",
    "2020-04-13": "Easter Monday",
    "2020-05-25": "George Floyd protests",
    "2020-07-06": "Birthday of the 14th Dalai Lama",
    "2020-08-04": "Beirut explosion",
    "2020-09-28": "Yom Kippur",
    "2020-11-03": "US presidential election",
    "2020-12-14": "First COVID-19 vaccines",
}

# --- üîπ –Ü–Ω—Ç–µ—Ä–∞–∫—Ç–∏–≤–Ω–∏–π –≥—Ä–∞—Ñ—ñ–∫ ---
fig = go.Figure()

# --- –õ—ñ–Ω—ñ—è —á–∞—Å—É —Ä–µ–ª—ñ–≥—ñ–π–Ω–∏—Ö –∑–≥–∞–¥–æ–∫ ---
fig.add_trace(go.Scatter(
    x=weekly["week"],
    y=weekly["religious_ratio"],
    mode="lines+markers",
    name="Religious mentions",
    line=dict(color="#0d47a1", width=3),
    marker=dict(size=6, color="#1976d2", line=dict(width=1, color="#0a2a6b")),
    hovertemplate="<b>%{x|%b %d, %Y}</b><br>Share: %{y:.3f}<extra></extra>"
))

# --- –í–µ—Ä—Ç–∏–∫–∞–ª—å–Ω—ñ –ª—ñ–Ω—ñ—ó + –∞–Ω–æ—Ç–∞—Ü—ñ—ó (—á–µ—Ä–≥—É–≤–∞–Ω–Ω—è –≤–µ—Ä—Ö/–Ω–∏–∑) ---
for i, (date_str, label) in enumerate(events.items()):
    date = pd.to_datetime(date_str).to_pydatetime()
    
    # –õ—ñ–Ω—ñ—è –ø–æ–¥—ñ—ó
    fig.add_shape(
        type="line",
        x0=date, x1=date,
        y0=0, y1=1,
        xref="x", yref="paper",
        line=dict(color="rgba(0,0,80,0.25)", width=2, dash="dot")
    )

    # –ß–µ—Ä–≥—É–≤–∞–Ω–Ω—è –ø–æ–∑–∏—Ü—ñ–π  
    bottom_labels = {"2020-05-25", "2020-11-03", "2020-07-06", "2020-12-14"} 
    
    if date_str in bottom_labels:
        y_pos = -0.05
        angle = 25
    else:
        y_pos = 1.02
        angle = 25

    # –î–æ–¥–∞–≤–∞–Ω–Ω—è –ø—ñ–¥–ø–∏—Å—É
    fig.add_annotation(
        x=date,
        y=y_pos,
        xref="x",
        yref="paper",
        showarrow=False,
        text=label,
        font=dict(size=10, color="black"),
        textangle=angle
    )


# --- –û—Ñ–æ—Ä–º–ª–µ–Ω–Ω—è ---
fig.update_layout(
    title={
        "text": "Religious References in Conspiracy Discussions (2020)<br><sup>with Key Global Events</sup>",
        "x": 0.5,
        "xanchor": "center",
        "font": dict(size=18)
    },
    xaxis_title="Week of 2020",
    yaxis_title="Share of religious texts",
    template="plotly_white",
    hovermode="x unified",
    plot_bgcolor="#f9fbff",
    paper_bgcolor="#f9fbff",
    font=dict(color="#0a2a6b"),
)

# --- –í—ñ—Å—å X –æ–±–º–µ–∂–∏—Ç–∏ 2020 ---
fig.update_xaxes(range=["2020-01-01", "2020-12-31"])

# --- –ó–±–µ—Ä–µ–∂–µ–Ω–Ω—è ---
fig.write_html("religious_trends_2020.html")
fig.show()


The data presented in the graph demonstrate that the presence of religious themes within conspiracy theory discussions fluctuated in close relation to major global events throughout 2020. Notably, during periods of intense socio-political and public health crises‚Äîsuch as the official declaration of the COVID-19 pandemic or the U.S. presidential election‚Äîthe share of religious references declined significantly. This suggests that, in moments of acute global tension, conspiracy narratives tended to focus more on political, scientific, or institutional explanations rather than invoking religious frameworks.

In contrast, during religious holidays and spiritually symbolic dates‚Äîsuch as Easter Monday or the birthday of the 14th Dalai Lama‚Äîthere was a clear rise in the frequency of religious rhetoric. These peaks imply that collective religious observances may serve as catalysts for renewed interest in theological or eschatological interpretations within conspiratorial discourse.

Overall, the findings indicate that religious references in conspiracy discussions are not constant but context-dependent: they tend to diminish during periods dominated by secular crises and resurface during times of religious or spiritual significance. This pattern highlights the dynamic interaction between religion, collective emotion, and the search for meaning in the face of uncertainty.

#### **Q22** Which keywords most strongly co-occur with ‚ÄúCOVID‚Äù or ‚Äúvirus‚Äù?
Identifying keyword co-occurrences uncovers how different ideas‚Äîsuch as ‚Äú5G,‚Äù ‚Äúvaccine,‚Äù ‚Äúcontrol,‚Äù or ‚ÄúBill Gates‚Äù‚Äîclustered around the concept of the virus, revealing the structure of pandemic-related conspiracies.