### Introduction

This project involves building a regression model in PySpark to predict the viral potential of social media memes, measured as an Engagement Score based on likes, shares, and comments. The model leverages features such as Image Complexity, Text Length, Posting Hour, Follower Count, Meme Category, and Platform, using techniques like one-hot encoding, feature scaling, and imputation to handle missing data. The dataset, sourced from meme-sharing platforms like Reddit, Instagram, and X, includes both numeric and categorical variables, enabling the model to capture cultural and contextual factors influencing meme virality. The goal is to provide actionable insights for content creators and marketers by predicting meme performance before posting.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MemeBoom Predictor 1').getOrCreate()

In [2]:
meme_data = spark.read.csv("../Data/meme_dataset.csv", header =True, inferSchema =True)

In [3]:
meme_data.show()

+----------------+----------------+-----------+------------+--------------+-------------+---------+
|Engagement Score|Image Complexity|Text Length|Posting Hour|Follower Count|Meme Category| Platform|
+----------------+----------------+-----------+------------+--------------+-------------+---------+
|     59.93428306|           0.692|         23|          10|        176299|    wholesome|   Reddit|
|     47.23471398|           0.426|         19|          19|        545315|     reaction|   TikTok|
|     62.95377076|           0.384|         27|          21|         96373|         dark|Instagram|
|     80.46059713|           0.687|         30|          13|          2258|    relatable|Instagram|
|     45.31693251|          -0.046|         25|          22|        453061|       satire|        X|
|     45.31726086|           0.245|         19|          10|        849414|    relatable|   Reddit|
|     81.58425631|           0.549|         20|           0|        479100|    wholesome|        X|


In [4]:
meme_data.printSchema

<bound method DataFrame.printSchema of DataFrame[Engagement Score: double, Image Complexity: double, Text Length: int, Posting Hour: int, Follower Count: int, Meme Category: string, Platform: string]>

In [5]:
from pyspark.sql.functions import col, when, count
meme_data.select([count(when(col(c).isNull(),c)).alias(c) for c in meme_data.columns]).show()

+----------------+----------------+-----------+------------+--------------+-------------+--------+
|Engagement Score|Image Complexity|Text Length|Posting Hour|Follower Count|Meme Category|Platform|
+----------------+----------------+-----------+------------+--------------+-------------+--------+
|               4|               3|          6|           4|             1|            2|       2|
+----------------+----------------+-----------+------------+--------------+-------------+--------+



In [6]:
clean_data = meme_data.dropna()

In [7]:
clean_data.select([count(when(col(c).isNull(),c)).alias(c) for c in clean_data.columns]).show()

+----------------+----------------+-----------+------------+--------------+-------------+--------+
|Engagement Score|Image Complexity|Text Length|Posting Hour|Follower Count|Meme Category|Platform|
+----------------+----------------+-----------+------------+--------------+-------------+--------+
|               0|               0|          0|           0|             0|            0|       0|
+----------------+----------------+-----------+------------+--------------+-------------+--------+



In [25]:
clean_data.describe().show()

+-------+------------------+-------------------+-----------------+------------------+------------------+-------------+--------+
|summary|  Engagement Score|   Image Complexity|      Text Length|      Posting Hour|    Follower Count|Meme Category|Platform|
+-------+------------------+-------------------+-----------------+------------------+------------------+-------------+--------+
|  count|             89978|              89978|            89978|             89978|             89978|        89978|   89978|
|   mean| 50.03592064727755|0.38724824957264936|19.98128431394341|11.498944186356665|499883.14123452397|         NULL|    NULL|
| stddev|19.802773907281697|0.17993630535247407| 4.46479435102178| 6.914074133713424| 288903.1315864442|         NULL|    NULL|
|    min|               0.0|               -0.3|                2|                 0|               103|       absurd|Facebook|
|    max|             100.0|               1.34|               42|                23|            999993|

In [11]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
# StringIndexer for Meme Category
category_indexer = StringIndexer(inputCol="Meme Category", outputCol="MemeCategoryIndex")
category_encoder = OneHotEncoder(inputCols=["MemeCategoryIndex"], outputCols=["MemeCategoryVec"])

# StringIndexer for Platform
platform_indexer = StringIndexer(inputCol="Platform", outputCol="PlatformIndex")
platform_encoder = OneHotEncoder(inputCols=["PlatformIndex"], outputCols=["PlatformVec"])

In [23]:
# Normalize Follower Count (optional, using StandardScaler)
from pyspark.ml.feature import StandardScaler, VectorAssembler
assembler = VectorAssembler(inputCols=["Follower Count"], outputCol="FollowerCountVec")
scaler = StandardScaler(inputCol="FollowerCountVec", outputCol="ScaledFollowerCount", withStd=True, withMean=True)

# Combine all features
feature_columns = ["Image Complexity", "Text Length", "Posting Hour", "ScaledFollowerCount", "MemeCategoryVec", "PlatformVec"]
final_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

In [26]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
# Initialize Linear Regression
regressor = LinearRegression(featuresCol = "features", labelCol = "Engagement Score")

#Creating a pipeline
pipeline = Pipeline(stages=[category_indexer, category_encoder, platform_indexer, platform_encoder, assembler, scaler, final_assembler, regressor])

#Split data into training and test sets
train_data, test_data = clean_data.randomSplit([0.8,0.2], seed = 42)

#Fit the model
meme_data_model = pipeline.fit(train_data)

In [32]:
from pyspark.ml.evaluation import RegressionEvaluator
# Make predictions
predictions = meme_data_model.transform(test_data)

# Evaluate the model
evaluator_rmse = RegressionEvaluator(labelCol="Engagement Score", predictionCol="prediction", metricName="rmse")
evaluator_r2 = RegressionEvaluator(labelCol="Engagement Score", predictionCol="prediction", metricName="r2")

rmse = evaluator_rmse.evaluate(predictions)
r2 = evaluator_r2.evaluate(predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"R² Score: {r2}")

Root Mean Squared Error: 19.5161281868429
R² Score: -6.569830206903937e-05


## Summary

For every meme in the test set, the model’s predictions are, on average, off by about 19.5 units of Engagement Score, which is significant given the score’s range of approximately 11 to 81. The negative R² score means the model is less accurate than simply predicting the average Engagement Score for every meme, suggesting it fails to capture meaningful patterns in features like Image Complexity, Follower Count, or Meme Category.