# ðŸ”„ Analysis 2 â€” Echo Chamber Score

**Core question:** Does a subreddit only reward posts that agree with its dominant sentiment?

**Method:**  
1. Find each subreddit's dominant sentiment (positive/negative/neutral)  
2. Compute `corr(title_sentiment, upvote_ratio)` per subreddit  
3. High positive correlation = posts matching community sentiment get upvoted = echo chamber  
4. Near-zero correlation = community upvotes based on other factors = healthier discourse  

**Interview talking point:**  
> "I quantified echo chamber behaviour by correlating post sentiment with upvote ratio within each subreddit. r/collapse and r/conservative show the strongest correlation â€” they heavily reward posts that match the community's dominant emotional tone."


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = (
    SparkSession.builder.appName('EchoChamber')
    .master('local[2]')
    .config('spark.driver.memory', '3g')
    .config('spark.sql.shuffle.partitions', '8')
    .getOrCreate()
)
spark.sparkContext.setLogLevel('WARN')

df = spark.read.parquet('/mnt/c/Users/gusmc/OneDrive/Desktop/reddit_historical_data/data/silver/posts')
df = df.filter(
    (F.col('score') >= 5) &
    (F.col('upvote_ratio') > 0) &
    F.col('title_sentiment').isNotNull()
)
print(f'Posts loaded: {df.count():,}')



Posts loaded: 3,449,973


                                                                                

In [3]:
# â”€â”€ 1. Dominant sentiment per subreddit â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
dominant_sentiment = (
    df.groupBy('subreddit', 'sentiment_label')
    .count()
    .withColumn('rank',
        F.rank().over(
            Window.partitionBy('subreddit').orderBy(F.desc('count'))
        )
    )
    .filter(F.col('rank') == 1)
    .select(
        'subreddit',
        F.col('sentiment_label').alias('dominant_sentiment'),
        F.col('count').alias('dominant_count')
    )
)

print('=== DOMINANT SENTIMENT PER SUBREDDIT ===')
dominant_sentiment.show(20, truncate=False)

=== DOMINANT SENTIMENT PER SUBREDDIT ===


                                                                                

+--------------------+------------------+--------------+
|subreddit           |dominant_sentiment|dominant_count|
+--------------------+------------------+--------------+
|aitah               |negative          |19383         |
|antiwork            |neutral           |121587        |
|changemyview        |negative          |28120         |
|collapse            |negative          |24840         |
|conservative        |negative          |142609        |
|dating_advice       |positive          |3002          |
|femaledatingstrategy|positive          |14395         |
|formuladank         |neutral           |140503        |
|politics            |negative          |495932        |
|soccercirclejerk    |neutral           |76428         |
|trueoffmychest      |negative          |73845         |
|unpopularopinion    |negative          |94365         |
|wallstreetbets      |neutral           |97553         |
|worldnews           |negative          |246857        |
+--------------------+---------

In [4]:
# â”€â”€ 2. Echo chamber score = corr(sentiment, upvote_ratio) â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# Pearson correlation built into Spark â€” no UDF needed
# Range: -1 to +1
#   +1 = positive posts always get upvoted (positive echo chamber)
#   -1 = negative posts always get upvoted (negativity echo chamber)
#    0 = sentiment doesn't predict upvotes at all

echo_scores = (
    df.groupBy('subreddit')
    .agg(
        F.count('*').alias('post_count'),
        F.round(F.corr('title_sentiment', 'upvote_ratio'), 4).alias('sentiment_upvote_corr'),
        F.round(F.corr('title_sentiment', 'score'), 4).alias('sentiment_score_corr'),
        F.round(F.avg('title_sentiment'), 4).alias('avg_sentiment'),
        F.round(F.avg('upvote_ratio'), 4).alias('avg_upvote_ratio'),
        F.round(F.stddev('title_sentiment'), 4).alias('sentiment_stddev'),
    )
    .withColumn('echo_chamber_score',
        # abs() because both strong positive AND strong negative correlations
        # indicate the sub rewards sentiment-aligned posts
        F.round(F.abs(F.col('sentiment_upvote_corr')), 4)
    )
    .withColumn('echo_tier',
        F.when(F.col('echo_chamber_score') > 0.3, 'STRONG_ECHO')
         .when(F.col('echo_chamber_score') > 0.15, 'MODERATE_ECHO')
         .otherwise('WEAK_ECHO')
    )
    .orderBy(F.desc('echo_chamber_score'))
)

print('=== ECHO CHAMBER SCORES (higher = stronger echo chamber) ===')
echo_scores.show(20, truncate=False)

=== ECHO CHAMBER SCORES (higher = stronger echo chamber) ===




+--------------------+----------+---------------------+--------------------+-------------+----------------+----------------+------------------+---------+
|subreddit           |post_count|sentiment_upvote_corr|sentiment_score_corr|avg_sentiment|avg_upvote_ratio|sentiment_stddev|echo_chamber_score|echo_tier|
+--------------------+----------+---------------------+--------------------+-------------+----------------+----------------+------------------+---------+
|worldnews           |471974    |-0.0404              |0.0029              |-0.1984      |0.9611          |0.4165          |0.0404            |WEAK_ECHO|
|conservative        |318287    |-0.0334              |0.0083              |-0.1183      |0.7807          |0.4071          |0.0334            |WEAK_ECHO|
|collapse            |52556     |-0.0273              |0.0                 |-0.1562      |0.9275          |0.381           |0.0273            |WEAK_ECHO|
|soccercirclejerk    |165029    |0.0253               |-0.0131             |

                                                                                

In [5]:
# â”€â”€ 3. Sentiment distribution per subreddit â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# The shape of sentiment distribution tells you a lot:
# - Narrow distribution = homogeneous community (echo chamber)
# - Wide distribution = diverse viewpoints (healthier)

print('=== SENTIMENT DISTRIBUTION BY SUBREDDIT ===')
(
    df.groupBy('subreddit', 'sentiment_label')
    .count()
    .groupBy('subreddit')
    .pivot('sentiment_label', ['positive', 'neutral', 'negative'])
    .sum('count')
    .fillna(0)
    .withColumn('total', F.col('positive') + F.col('neutral') + F.col('negative'))
    .withColumn('positive_pct', F.round(F.col('positive') / F.col('total') * 100, 1))
    .withColumn('negative_pct', F.round(F.col('negative') / F.col('total') * 100, 1))
    .withColumn('neutral_pct',  F.round(F.col('neutral')  / F.col('total') * 100, 1))
    .select('subreddit', 'total', 'positive_pct', 'neutral_pct', 'negative_pct')
    .orderBy(F.desc('negative_pct'))
).show(20, truncate=False)

=== SENTIMENT DISTRIBUTION BY SUBREDDIT ===




+--------------------+-------+------------+-----------+------------+
|subreddit           |total  |positive_pct|neutral_pct|negative_pct|
+--------------------+-------+------------+-----------+------------+
|worldnews           |471974 |19.5        |28.2       |52.3        |
|trueoffmychest      |145483 |26.1        |23.1       |50.8        |
|collapse            |52556  |19.4        |33.3       |47.3        |
|conservative        |318287 |24.3        |30.9       |44.8        |
|changemyview        |66113  |33.4        |24.1       |42.5        |
|politics            |1167834|28.9        |28.6       |42.5        |
|unpopularopinion    |223777 |37.4        |20.5       |42.2        |
|aitah               |46776  |22.9        |35.7       |41.4        |
|antiwork            |317854 |27.2        |38.3       |34.5        |
|femaledatingstrategy|39924  |36.1        |33.5       |30.4        |
|dating_advice       |8119   |37.0        |34.2       |28.8        |
|soccercirclejerk    |165029 |31.9

                                                                                

In [6]:
# â”€â”€ 4. How does upvote ratio differ for aligned vs misaligned posts? â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# For each sub: compare upvote_ratio when sentiment matches dominant
# vs when it goes against the grain

enriched = df.join(F.broadcast(dominant_sentiment), on='subreddit', how='left')

aligned_analysis = (
    enriched
    .withColumn('sentiment_aligned',
        F.col('sentiment_label') == F.col('dominant_sentiment')
    )
    .groupBy('subreddit', 'sentiment_aligned')
    .agg(
        F.count('*').alias('post_count'),
        F.round(F.avg('upvote_ratio'), 4).alias('avg_upvote_ratio'),
        F.round(F.avg('score'), 1).alias('avg_score'),
    )
    .orderBy('subreddit', 'sentiment_aligned')
)

print('=== ALIGNED VS MISALIGNED POST PERFORMANCE ===')
print('(sentiment_aligned=True means post matches subreddits dominant tone)')
aligned_analysis.show(40, truncate=False)

=== ALIGNED VS MISALIGNED POST PERFORMANCE ===
(sentiment_aligned=True means post matches subreddits dominant tone)




+--------------------+-----------------+----------+----------------+---------+
|subreddit           |sentiment_aligned|post_count|avg_upvote_ratio|avg_score|
+--------------------+-----------------+----------+----------------+---------+
|aitah               |false            |27393     |0.8657          |523.5    |
|aitah               |true             |19383     |0.8656          |620.4    |
|antiwork            |false            |196267    |0.9179          |881.0    |
|antiwork            |true             |121587    |0.9176          |922.0    |
|changemyview        |false            |37993     |0.9092          |220.2    |
|changemyview        |true             |28120     |0.9058          |245.1    |
|collapse            |false            |27716     |0.9253          |312.7    |
|collapse            |true             |24840     |0.9299          |318.2    |
|conservative        |false            |175678    |0.7767          |250.5    |
|conservative        |true             |142609    |0

                                                                                

In [7]:
spark.stop()
print('Echo chamber analysis complete âœ“')

Echo chamber analysis complete âœ“
