# ðŸŽ£ Analysis 1 â€” Ragebait Detector

**Core question:** Which subreddits manufacture outrage? Which posts get people commenting furiously but not upvoting?

**Definition:**  
`controversy_ratio = num_comments / score`  
- High ratio = people argue but don't upvote = ragebait  
- Low ratio = people upvote without engaging = good content  

**Interview talking point:**  
> "I operationalised ragebait as posts where comment velocity significantly outpaces score â€” the community is reacting, not rewarding. r/femaledatingstrategy and r/politics score highest on this metric."


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = (
    SparkSession.builder.appName('Ragebait')
    .master('local[2]')
    .config('spark.driver.memory', '3g')
    .config('spark.sql.shuffle.partitions', '8')
    .getOrCreate()
)
spark.sparkContext.setLogLevel('WARN')

df = spark.read.parquet('/mnt/c/Users/gusmc/OneDrive/Desktop/reddit_historical_data/data/silver/posts')

# Only look at posts with some traction to avoid noise
df = df.filter((F.col('score') >= 5) & (F.col('num_comments') >= 2))
print(f'Filtered posts: {df.count():,}')

26/02/27 04:29:50 WARN Utils: Your hostname, terminator resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
26/02/27 04:29:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/27 04:29:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Filtered posts: 2,671,772


                                                                                

In [3]:
# â”€â”€ 1. Subreddit-level ragebait score â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
sub_rage = (
    df.groupBy('subreddit')
    .agg(
        F.count('*').alias('total_posts'),
        F.round(F.avg('controversy_ratio'), 3).alias('avg_controversy_ratio'),
        F.round(F.percentile_approx('controversy_ratio', 0.75), 3).alias('p75_controversy'),
        F.round(F.percentile_approx('controversy_ratio', 0.95), 3).alias('p95_controversy'),
        F.round(F.avg('score'), 1).alias('avg_score'),
        F.round(F.avg('num_comments'), 1).alias('avg_comments'),
        F.round(F.avg('upvote_ratio'), 3).alias('avg_upvote_ratio'),
        F.round(F.avg('title_sentiment'), 4).alias('avg_sentiment'),
    )
    .withColumn('ragebait_tier',
        F.when(F.col('avg_controversy_ratio') > 2.0, 'HIGH_RAGEBAIT')
         .when(F.col('avg_controversy_ratio') > 1.0, 'MEDIUM')
         .otherwise('LOW')
    )
    .orderBy(F.desc('avg_controversy_ratio'))
)

print('=== SUBREDDIT RAGEBAIT RANKING ===')
sub_rage.show(20, truncate=False)

=== SUBREDDIT RAGEBAIT RANKING ===


                                                                                

+--------------------+-----------+---------------------+---------------+---------------+---------+------------+----------------+-------------+-------------+
|subreddit           |total_posts|avg_controversy_ratio|p75_controversy|p95_controversy|avg_score|avg_comments|avg_upvote_ratio|avg_sentiment|ragebait_tier|
+--------------------+-----------+---------------------+---------------+---------------+---------+------------+----------------+-------------+-------------+
|changemyview        |65877      |3.322                |4.0            |10.857         |231.5    |143.5       |0.908           |-0.0607      |HIGH_RAGEBAIT|
|aitah               |46400      |1.867                |2.333          |5.889          |568.1    |167.9       |0.865           |-0.0784      |MEDIUM       |
|wallstreetbets      |167897     |1.385                |1.25           |3.4            |469.6    |159.7       |0.956           |0.0328       |MEDIUM       |
|dating_advice       |8001       |1.33                 |1.

In [4]:
# â”€â”€ 2. Top ragebait POSTS (the actual posts causing most arguments) â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print('=== TOP 30 INDIVIDUAL RAGEBAIT POSTS ===')
(
    df
    .filter(F.col('controversy_ratio') > 3.0)
    .select(
        'subreddit',
        F.col('title').substr(1, 80).alias('title'),
        'score', 'num_comments',
        F.round('controversy_ratio', 2).alias('controversy_ratio'),
        'sentiment_label', 'year_month'
    )
    .orderBy(F.desc('controversy_ratio'))
    .limit(30)
).show(truncate=False)

=== TOP 30 INDIVIDUAL RAGEBAIT POSTS ===




+----------------+--------------------------------------------------------------------------------+-----+------------+-----------------+---------------+----------+
|subreddit       |title                                                                           |score|num_comments|controversy_ratio|sentiment_label|year_month|
+----------------+--------------------------------------------------------------------------------+-----+------------+-----------------+---------------+----------+
|wallstreetbets  |What Are Your Moves Tomorrow, October 01, 2020                                  |42   |11558       |275.19           |neutral        |2020-09   |
|unpopularopinion|LGBTQ+ Mega Thread                                                              |8    |1986        |248.25           |neutral        |2021-06   |
|politics        |[Meta] 19 more domains unfiltered, notes on voting, censorship in /r/politics an|6    |1137        |189.5            |positive       |2013-11   |
|politics       

                                                                                

In [5]:
# â”€â”€ 3. Does NEGATIVE sentiment posts get more controversy? â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print('=== CONTROVERSY RATIO BY SENTIMENT LABEL ===')
(
    df.groupBy('subreddit', 'sentiment_label')
    .agg(
        F.count('*').alias('post_count'),
        F.round(F.avg('controversy_ratio'), 3).alias('avg_controversy'),
        F.round(F.avg('score'), 1).alias('avg_score'),
    )
    .orderBy('subreddit', 'sentiment_label')
).show(40, truncate=False)

=== CONTROVERSY RATIO BY SENTIMENT LABEL ===




+--------------------+---------------+----------+---------------+---------+
|subreddit           |sentiment_label|post_count|avg_controversy|avg_score|
+--------------------+---------------+----------+---------------+---------+
|aitah               |negative       |19236     |1.8            |625.1    |
|aitah               |neutral        |16545     |1.96           |514.1    |
|aitah               |positive       |10619     |1.846          |549.1    |
|antiwork            |negative       |97319     |0.507          |962.9    |
|antiwork            |neutral        |106408    |0.543          |1050.4   |
|antiwork            |positive       |74792     |0.483          |1051.7   |
|changemyview        |negative       |28084     |3.325          |245.3    |
|changemyview        |neutral        |15862     |3.379          |221.7    |
|changemyview        |positive       |21931     |3.277          |221.1    |
|collapse            |negative       |24198     |0.471          |326.0    |
|collapse   

                                                                                

In [6]:
# â”€â”€ 4. Ragebait over time â€” is it getting worse? â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
print('=== MONTHLY RAGEBAIT TREND (politics + worldnews + AITAH) ===')
(
    df
    .filter(F.col('subreddit').isin('politics', 'worldnews', 'aitah'))
    .groupBy('subreddit', 'year_month')
    .agg(
        F.count('*').alias('posts'),
        F.round(F.avg('controversy_ratio'), 3).alias('avg_controversy'),
        F.round(F.avg('upvote_ratio'), 3).alias('avg_upvote_ratio'),
    )
    .orderBy('subreddit', 'year_month')
).show(60, truncate=False)

=== MONTHLY RAGEBAIT TREND (politics + worldnews + AITAH) ===




+---------+----------+-----+---------------+----------------+
|subreddit|year_month|posts|avg_controversy|avg_upvote_ratio|
+---------+----------+-----+---------------+----------------+
|aitah    |2025-03   |4622 |1.924          |0.868           |
|aitah    |2025-04   |4643 |1.763          |0.874           |
|aitah    |2025-05   |5222 |1.82           |0.873           |
|aitah    |2025-06   |5383 |1.801          |0.867           |
|aitah    |2025-07   |5063 |1.915          |0.864           |
|aitah    |2025-08   |4669 |1.874          |0.86            |
|aitah    |2025-09   |4318 |1.848          |0.861           |
|aitah    |2025-10   |4031 |1.837          |0.86            |
|aitah    |2025-11   |4130 |1.987          |0.86            |
|aitah    |2025-12   |4319 |1.931          |0.863           |
|politics |2007-08   |1378 |0.478          |1.0             |
|politics |2007-09   |2044 |0.539          |1.0             |
|politics |2007-10   |3232 |0.555          |1.0             |
|politic

                                                                                

In [7]:
spark.stop()
print('Ragebait analysis complete âœ“')

Ragebait analysis complete âœ“
