# Frequency Analysis of Subreddits

**Author:** Noah Wassberg

**Date Created:** March 7, 2024

## Description

This notebook analyzes the average amount of vulgarity per post in subreddits. This is done by using the reddit post dataset and the Surge AI profanity list dataset. The analysis aims to find which subreddits on average have the most/least explicit language in text content.

## Output

The output of this analysis is 2 lists of subreddits ranked by the average number of explicit language per post. Each subreddit is evaluated based on the number of explicit words divided by the number of posts from that subreddit, as determined by matching words and phrases against the Surge AI profanity list.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct, explode, split, col, lower, sum as spark_sum, first, broadcast, concat_ws

#Init Spark session
spark_session = SparkSession.builder\
        .master("spark://192.168.2.193:7077") \
        .appName("test")\
        .config("spark.dynamicAllocation.enabled", True)\
        .config("spark.dynamicAllocation.shuffleTracking.enabled",True)\
        .config("spark.shuffle.service.enabled", False)\
        .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
        .config("spark.executor.cores", 4)\
        .config("spark.driver.port",9999)\
        .config("spark.blockManager.port",10005)\
        .getOrCreate()

#Get Spark context
spark_context = spark_session.sparkContext
#Set log level to error
spark_context.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/07 09:00:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
#Import profanity list
profanity_unfiltered = spark_session.read.csv("hdfs://192.168.2.193:9000/user/hadoop/input/input/profanity_en.csv", 
                                         header='true', inferSchema='true')

#Import reddit post dataset
posts_unfiltered = spark_session.read.json("hdfs://192.168.2.193:9000/user/hadoop/input/input/corpus-webis-tldr-17.json")

                                                                                

In [3]:
#Remove unnecessary fields from profanity list.
profanity = profanity_unfiltered.drop("canonical_form_1", "canonical_form_2", "canonical_form_3",
                          "category_1","category_2","category_3","severity_description", "severity_rating")

#Remove unnecessary fields from reddit post dataset.
posts_unfiltered = posts_unfiltered.drop("summary","summary_len","content_len","author", "body","normalizedBody")

In [4]:
#Get the count of unique posts for each subreddit in the dataset
#and remove subredddits with fewer than 200 posts to avoid outliers
post_counts = posts_unfiltered.groupBy("subreddit_id").agg(countDistinct("id").alias("total_posts")) \
    .filter(col("total_posts") >= 200)

#Remove all posts from subreddits with fewar than 200 posts                                 
posts = posts_unfiltered.join(broadcast(post_counts), "subreddit_id")

In [5]:
#Preprocess the content of each reddit post and the profanity list

#Combine 'content' and 'title' into a single column for tokenization
posts_combined = posts.withColumn("combined", concat_ws(" ", col("content"), col("title")))

#Tokenize the combined content and title and convert to lowercase
posts_tokenized = posts_combined.withColumn("word", explode(split(lower(col("combined")), "\\s+")))

#Lowercase the 'text' in profanity DataFrame for case-insensitive matching
profanity = profanity.withColumn("text", lower(col("text")))

In [8]:
#Join the tokenized posts with the profanity DataFrame on the word matching 'text'
#Using a broadcast join since profanity is much smaller than posts_tokenized
joined_df = posts_tokenized.join(broadcast(profanity), posts_tokenized.word == profanity.text)



#Group by subreddit, count explicit words, and rename the count column.
explicit_count = joined_df.groupBy("subreddit_id","subreddit").count().withColumnRenamed("count", "explicit_word_count")

In [17]:
#Join with post counts, calculate average explicit words per post, and select relevant columns.
avg_explicit_count = post_counts.join(explicit_count, "subreddit_id") \
    .select("subreddit", (col("explicit_word_count") / col("total_posts")).alias("average_explicit_words"))

In [18]:
#Cache the DataFrame to avoid recomputations
avg_explicit_count.cache()

#Show the 20 least explicit subreddits
print("Top 20 least severe subreddits")
avg_explicit_count.orderBy(col("average_explicit_words").asc()).show()
#Show the 20 most explicit subreddits
print("Top 20 most severe subreddits")
avg_explicit_count.orderBy(col("average_explicit_words").desc()).show()

#clear the cache
avg_explicit_count.unpersist()

Top 20 least severe subreddits


                                                                                

+---------------+----------------------+
|      subreddit|average_explicit_words|
+---------------+----------------------+
|   HomeworkHelp|  0.016826923076923076|
|    learnpython|  0.017057569296375266|
|    rocketbeans|   0.02240325865580448|
|   TheSilphRoad|   0.02710843373493976|
|        grammar|  0.029556650246305417|
|latterdaysaints|  0.029595015576323987|
|     dogemining|  0.030837004405286344|
|       churning|   0.03355704697986577|
|          italy|   0.03477218225419664|
|       Toontown|  0.037815126050420166|
|       chromeos|  0.040697674418604654|
|        Unity3D|  0.044585987261146494|
|           Hair|  0.044897959183673466|
|         spacex|  0.045454545454545456|
|      Wordpress|   0.04556962025316456|
|     mtgfinance|  0.045871559633027525|
|    freemasonry|   0.04669260700389105|
|        pkmntcg|              0.046875|
|       portugal|  0.048223350253807105|
|   empirepowers|   0.04924242424242424|
+---------------+----------------------+
only showing top

DataFrame[subreddit: string, average_explicit_words: double]