# Vulgarity Analysis in Reddit Posts by Authors

**Author:** Agron Metaj

**Date Created:** March 4, 2024  

**Description:** This notebook analyzes the frequency and severity of vulgar language used by authors in a dataset of Reddit posts. Using a predefined list of profanities, it identifies top contributors to vulgar content across various subreddits. The analysis aims to highlight patterns in the use of vulgar language and may assist in content moderation efforts.

**Output:**  
- A list of authors ranked by the count of their vulgar posts and the average severity of vulgarity.

In [1]:
import re
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct, explode, split, col, lower, sum as spark_sum, first, broadcast, concat_ws, udf, avg
from pyspark.sql.types import ArrayType, StringType

#Init Spark session
spark_session = SparkSession.builder\
        .master("spark://192.168.2.193:7077") \
        .appName("test")\
        .config("spark.dynamicAllocation.enabled", True)\
        .config("spark.dynamicAllocation.shuffleTracking.enabled",True)\
        .config("spark.shuffle.service.enabled", False)\
        .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
        .config("spark.executor.cores", 2)\
        .config("spark.driver.port",9999)\
        .config("spark.blockManager.port",10005)\
        .getOrCreate()

#Get Spark context
spark_context = spark_session.sparkContext
#Set log level to error
spark_context.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/04 10:38:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
#Import profanity list
profanity_unfiltered = spark_session.read.csv("hdfs://192.168.2.193:9000/user/hadoop/input/input/profanity_en.csv", 
                                         header='true', inferSchema='true')

#Import reddit post dataset
posts_unfiltered = spark_session.read.json("hdfs://192.168.2.193:9000/user/hadoop/input/input/corpus-webis-tldr-17.json")

                                                                                

In [3]:
# Define a UDF to clean and tokenize post content
def clean_and_tokenize(content):
    # Simple tokenization and lowercasing, customize as needed
    tokens = re.split(r'\W+', content.lower())
    return [token for token in tokens if token]

# Register the UDF in Spark
clean_and_tokenize_udf = udf(clean_and_tokenize, ArrayType(StringType()))

# Preprocess Reddit posts to tokenize the content
posts = posts_unfiltered.withColumn("tokens", clean_and_tokenize_udf(col("content")))


In [4]:
# Explode tokens for joining with the profanity list
posts_exploded = posts.select("author", "id", explode(col("tokens")).alias("token"))

# Filter profanity list to relevant columns and broadcast for efficient joining
profanity_filtered = broadcast(profanity_unfiltered.select("text", "severity_rating"))

# Join exploded posts with profanity list on matching tokens
joined = posts_exploded.join(profanity_filtered, posts_exploded.token == lower(profanity_filtered.text), "inner")


In [5]:
# Aggregate vulgarity metrics by author
vulgarity_by_author = joined.groupBy("author")\
                            .agg(countDistinct("id").alias("vulgar_posts_count"), 
                                 avg("severity_rating").alias("average_severity"))

# Order by vulgar post count and severity for top contributors
top_vulgar_authors = vulgarity_by_author.orderBy(col("vulgar_posts_count").desc(), col("average_severity").desc())

# Display top vulgar authors
top_vulgar_authors.show()

# Cleanup
spark_session.stop()

                                                                                

+-----------------+------------------+------------------+
|           author|vulgar_posts_count|  average_severity|
+-----------------+------------------+------------------+
|        [deleted]|            122183|1.5462271483014198|
|     iamtotalcrap|               882|1.5453619909502254|
|          DejaBoo|               314|1.6144329896907217|
|          codayus|               231|1.4870848708487086|
|       Shaper_pmp|               216|1.4568527918781728|
|     josiahpapaya|               171|1.6237659963436928|
|       pixis-4950|               171|1.5519736842105263|
|   dinosaur_train|               161| 1.522788203753351|
|           mauxly|               160| 1.589734513274336|
|      Death_Star_|               150| 1.262992125984252|
|  RamsesThePigeon|               149| 1.554983922829582|
|      herman_gill|               143|1.5890109890109894|
|       FrankManic|               142| 1.598224852071006|
|       Stingray88|               141|1.7966244725738392|
|   iamadogfor