# Spread Profanity Analysis

**Author:** Simon Pislar

**Date Created:** March 14, 2024

**Description:** This notebook analyzes the spread of profanity across Reddit users and subreddits. It uses a dataset of profanity words and a dataset of Reddit posts to find the top 20 users who have used profanity across the most subreddits.

**Output:** The top 20 users with profanity across the most subreddits.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lower, explode, split, countDistinct, broadcast

#Init Spark session
spark_session = SparkSession.builder\
        .master("spark://group3-master:7077") \
        .appName("Spread_Profanity_Analysis")\
        .getOrCreate()

#Get Spark context
spark_context = spark_session.sparkContext
#Set log level to error
spark_context.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/14 03:12:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# Load the profanity dataset
profanity_df = spark_session.read.csv("file:///home/ubuntu/profanity/profanity_en.csv", 
                                   header=True, inferSchema=True).select("text", "severity_rating")

# Load the Reddit dataset
reddit_df = spark_session.read.json("file:///home/ubuntu/volume/reddit/corpus-webis-tldr-17.json")


                                                                                

In [3]:
# Tokenize the 'body' field
reddit_tokenized = reddit_df.withColumn("words", explode(split(lower(col("body")), "\\W+")))

# Filter for profanity
reddit_profanity = reddit_tokenized.join(broadcast(profanity_df), col("words") == col("text"), "inner")

# Count distinct subreddits for each author
author_subreddit_count = reddit_profanity.groupBy("author") \
                                         .agg(countDistinct("subreddit").alias("subreddit_count"))


In [4]:
# Get the top 20 users with profanity across the most subreddits
top_users = author_subreddit_count.orderBy(col("subreddit_count").desc()).limit(20)

top_users.show()

                                                                                

+----------------+---------------+
|          author|subreddit_count|
+----------------+---------------+
|       [deleted]|           4525|
|         DejaBoo|            103|
|      FrankManic|             61|
|     Death_Star_|             56|
|     herman_gill|             56|
|      Shaper_pmp|             45|
|Rancid_Bear_Meat|             44|
|      Batty-Koda|             38|
|       kleinbl00|             37|
|          mauxly|             36|
|    backnblack92|             36|
|      CocoSavege|             35|
|    HittingSmoke|             35|
|      well_golly|             35|
|  ldonthaveaname|             35|
|            KoNP|             34|
|      Stingray88|             33|
|   use_more_lube|             32|
|       AngryData|             32|
|      ATomatoAmI|             31|
+----------------+---------------+


In [None]:
# Clean up resources
spark_session.stop()