# 🎯 Objective

This notebook explores **political polarization in Reddit discussions about Trump**.  
We analyze **comment sentiment** and **comment length** to understand differences between left-leaning, right-leaning, and neutral subreddits.

---

## 📚 Libraries
```python
# 📚 Import Required Libraries


In [33]:
import os  # File path handling
import sqlite3  # Database connection
import pandas as pd  # Data analysis
import numpy as np  # Numerical computations
from lets_plot import *  # Data visualization
LetsPlot.setup_html()  # Enable lets-plot for Jupyter


In [34]:
# 📥 Load Data from SQLite
DB_PATH = "/files/ds105a-2024-alternative-summative-ajchan03/data/reddit_data.db"
conn = sqlite3.connect(DB_PATH)

# Load posts & comments data
df_posts = pd.read_sql_query("SELECT * FROM posts;", conn)
df_comments = pd.read_sql_query("SELECT * FROM comments;", conn)

# Close connection
conn.close()

# ✅ Add Comment Length Column (Word Count)
df_comments["comment_length"] = df_comments["body"].apply(lambda x: len(x.split()))

# ✅ Categorize Sentiment
df_comments["sentiment_category"] = pd.cut(
    df_comments["comment_sentiment"],
    bins=[-1, -0.05, 0.05, 1],
    labels=["Negative", "Neutral", "Positive"]
)

# ✅ Display Filtered Data
print("\n📊 Sample Trump-Related Comments:")
display(df_comments.head())



📊 Sample Trump-Related Comments:


Unnamed: 0,post_id,comment_id,body,score,created_utc,subreddit,comment_sentiment,comment_length,sentiment_category
0,1d4emcb,l6drhd7,"You talking about twice impeached, convicted f...",1246,2024-05-30 21:16:40,politics,0.0,10,Neutral
1,1d4emcb,l6dycuu,Kinda crazy to think that Donald Trump had com...,39,2024-05-30 21:56:35,politics,-0.5434,19,Negative
2,1d4emcb,l6dwst2,"We can now officially add ""convicted felon"" to...",36,2024-05-30 21:47:25,politics,0.0,17,Neutral
3,1d4emcb,l6dueig,"Wooo a 34/34, perfect 100%! Highest grade Trum...",72,2024-05-30 21:33:33,politics,0.6114,13,Positive
4,1d4emcb,l6dvis0,The first line of his wikipedia page (not by m...,74,2024-05-30 21:39:59,politics,0.4215,42,Positive


# 📊 Visualization 1: Sentiment Distribution of Trump-Related Comments (Violin Plot)

In [35]:
# 📊 Sentiment Distribution by Subreddit
print("\n📊 Generating Sentiment Distribution Plot...")
p1 = ggplot(df_comments, aes(x="subreddit", y="comment_sentiment", fill="subreddit")) + \
    geom_violin(alpha=0.6) + \
    geom_boxplot(width=0.2, outlier_alpha=0.3) + \
    ggtitle("🔴 Sentiment Distribution of Trump-Related Comments") + \
    xlab("Subreddit") + ylab("Sentiment Score (-1 to 1)") + \
    theme_minimal()
display(p1)


📊 Generating Sentiment Distribution Plot...


📊 Visualization 2: Comment Length Distribution (Histogram)

In [48]:

df_filtered_comments = df_comments[df_comments["subreddit"].isin(["politics", "Conservative"])]


# 📊 Comment Length Distribution by Subreddit
print("\n📊 Generating Comment Length Distribution Plot...")
p2 = ggplot(df_filtered_comments, aes(x="comment_length", fill="subreddit")) + \
    geom_histogram(bins=30, alpha=0.6, position="identity") + \
    ggtitle("🔵 Comment Length Distribution") + \
    xlab("Comment Length (Words)") + ylab("Frequency") + \
    xlim(0, 250) + \
    theme_minimal()
display(p2)


SyntaxError: unexpected character after line continuation character (502036187.py, line 19)

In [53]:
from lets_plot import *

# Initialize Lets-Plot for Jupyter Notebook
LetsPlot.setup_html()

# Make a copy to avoid SettingWithCopyWarning
df_filtered_comments = df_comments[df_comments["subreddit"].isin(["politics", "Conservative"])].copy()

# Convert 'subreddit' column to categorical type
df_filtered_comments["subreddit"] = df_filtered_comments["subreddit"].astype("category")

# 📊 Improved Comment Length Distribution Plot with Density
print("\n📊 Generating Comment Length Distribution Plot with Density...")
p2 = ggplot(df_filtered_comments, aes(x="comment_length", fill="subreddit", y="..density..")) + \
    geom_histogram(bins=30, alpha=0.6, position="identity") + \
    ggtitle("📊 Normalized Comment Length Distribution by Subreddit") + \
    xlab("Comment Length (Words)") + \
    ylab("Density") + \
    xlim(0, 250) + \
    scale_fill_manual(values={"r/Politics": "#1f77b4", "r/Conservative": "#d62728"}) + \
    theme_minimal()

display(p2)



📊 Generating Comment Length Distribution Plot with Density...


## 🔍 Key Findings:

1️⃣ **Sentiment Polarization**
   - **r/politics** has a higher proportion of **negative Trump-related comments**.
   - **r/Conservative** shows a wider range of sentiment, including some **positive Trump-related comments**.
   - **r/PoliticalDiscussion** has more **neutral** comments.

2️⃣ **Comment Length Differences**
   - **r/politics** tends to have **longer** discussions.
   - **r/Conservative** has **shorter but more frequent** comments.
   - **r/PoliticalDiscussion** is in the middle, with a mix of short and long comments.

These findings highlight **how different political communities discuss Trump** on Reddit.
