# Predicting D&D Campaign YouTube Views

I have been building a dataset of TTRPG playlists from YouTube and Podcast formats. What I aimed to do with this project is to create a model that could take in the parameters of a TTRPG show and predict the number of views it could receive or identify if the kind of game system, number of players, or age of the channel didn't correlate and thus wasn't necessary for those creating new channels.

For this I created three sets of features: A tfidf vector based on the keywords returned from YouTubes API, a game system onehot encoded value for every game system used in the dataset. For instance Dungeons and Dragons may be a 1, Pathfinder is a 2, Blades in the Dark is a 3 and so on. And then the remainder of the data was all numeric data. In total this was the data used for the model:

#### Data

| Data Field              | Data Type | Description                                                                                                                                             |
|-------------------------|-----------|:--------------------------------------------------------------------------------------------------------------------------------------------------------|
| Genre Keywords          | String[]  | Genre keywords is stored as a text string with each keyword separated by a comma and surrounded by double quotes. Ex. ["fantasy", "tabletop", "gaming"] |
| Game System Type        | String    | Game system is a string value and is the name of the Game System used. Ex. Dungeons and Dragons, Pathfinder, Delta Green, etc.                          |
| Channel Age             | Integer   | Channel age is the number of days between the date the channel was created to 11/10/2024                                                                |
| Campaign Age            | Integer   | Campaign age is the number of days between the date the campaign playlist was created to 11/10/2024                                                     |
| Episode Count           | Integer   | The number of episodes in the campaign based on the number of videos attached to the playlist.                                                          |
| Combined Episode Length | Integer   | The combined duration of each episode. First each video length is calculted to the number of seconds, then they are totaled.                            |
| Average Episode Length  | Float     | The average duration for all episodes in the playlist. This is the combined episode length divided by the number of episodes.                           |

### Goal

I am going to use these datapoints to train a model that predicts:
- total campaign views or 
- average video views

What I hope to determine is to find what parameters a new TTRPG gaming group could begin with in order to ensure decent view counts. My initial assumptions are that campaigns using Dungeons and Dragons have a good chance of having higher view counts. I also imagine that campaigns with genre keywords like "comedy" will have a higher view count than "horror" or "political intrigue". There are also campaigns with very long episode durations like 4 hours, and campaigns with shorter episodes even down to 30 minutes each episode with a lot of extra content trimmed out. Lastly, I predict that age of the channel or age of the campaign will be the biggest predictor of view counts simply by having a longer time to capture views.

There was some available data I kept out of this model. I didn't include YouTube engagement information like subscriber count, like count, or comment count. My problem statement is to try to recommend parameters that the group can control and subscriber counts or like counts are outside of their control. I do wonder however if subscriber count is a likely predictor of view counts and is unique enough from channel and campaign age to make the model more accurate. I also tried to include data on the number of players in a campaign. This became very difficult to add to the dataset and I created a number of methods to try and determine this including creating a named recognition model with the spacy library and my own spans and tagging methods to try and count player names. Eventually I would like to add player demographics and see if that plays a part in a campaign's performance. For example are people more likely to watch a campaign if there is diversity in gender, LGBT, and POC inclusion.  

### Problems

A glaring problem with the current state of this project is the size of the dataset. I tried to create methods to identify and tag each playlist a certain way. For instance I tried to use a named entity recognition model in Spacy to identify as many campaigns with Dungeons and Dragons or Pathfinder in their descriptions. The problem was is the generic spacy language model for English isn't trained to identify all TTRPG game systems as their "PRODUCT" entity type. I did create a method of tagging descriptions however I ran out of time to create enough tagged descriptions to be able to predict game systems for all descriptions. So at this time I only have 11 campaigns tagged with the game system used. 

```mermaid
graph TB

subgraph youtube[YouTube API]
    A>Playlist Search Endpoint]
    C>Playlist Items Endpoint]
    F>Video Stats Endpoint]
end

subgraph dw[Data Warehouse]
    direction TB
    B[(Playlist Store)]
    D[(Playlist Item Store)]
    E[(Video Stats Store)]
    GS[(Game System Records)]
    PL[(Player Records)]
end

A -->|Get Playlist Metadata| P[[Playlist Puller]]
P --> B
C -->|Retrieve Playlist Items| PI[[Playlist Item Puller]]
B --> PI
PI --> D
F -->|Update Video Statistics| VS[[Video Stats Puller]]
D--> VS
VS ----> E

B -->|channel_date,\n playlist_date| data
D -->|video_name,\n video_keywords,\n episode_count| data
E -->|view_count,\n video_length| data
M[\Manual Input/] -.- MAPI{{MyTableTopList API}}
MAPI -.- GS
MAPI -.- PL
GS -->|game_system_name| data
PL -.-|player_count| data 

data[\Data\] --> model[[Views Predictor Model]]

subgraph legend
    direction LR
    Y[(Database Table)] 
    Z[[Airflow DAG]]
    X{{API}}
end

style youtube fill:#f66
style M fill:#0f0
style legend fill:#c0b3ec
style dw fill:#20d4ce
```

### Library Imports

In [6]:
# Imports
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import (
    row_number,
    col,
    when,
    desc,
    udf,
    concat_ws,
    explode,
    split,
    count,
    regexp_replace,
    size,
    sum,
    expr,
    first,
    avg,
    variance,
    concat,
    lit,
    lower)
from pyspark.sql.types import StructType, StructField, ArrayType, StringType, IntegerType, DoubleType
from pyspark.ml.feature import HashingTF, IDF, StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

from pathlib import Path

### Creating Spark Context

In [2]:
# First we set up a spark session and define the database parameters. 
spark = SparkSession \
    .builder \
    .appName("YouTube Views Prediction Model") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.driver.maxResultSize", "8g") \
    .config("spark.network.timeout", 10000000) \
    .config("spark.executor.heartbeatInterval", 10000000) \
    .config("spark.storage.blockManagerSlaveTimeoutMs", 10000000) \
    .config("spark.executor.memory", "10g") \
    .master("local") \
    .getOrCreate()

24/11/11 07:33:11 WARN Utils: Your hostname, geoff-workstation resolves to a loopback address: 127.0.1.1; using 192.168.1.47 instead (on interface eno1)
24/11/11 07:33:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/11 07:33:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/11/11 07:33:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Getting data from csv file 

In [7]:
# Create a DataFrame of the campaigns and associated data from the database
campaign_schema = StructType([
    StructField('playlist_id', StringType(), True),
    StructField('campaign_uuid', StringType(), True),
    StructField('campaign_name', StringType(), True),
    StructField('campaign_age', IntegerType(), True),
    StructField('channel_age', IntegerType(), True),
    StructField('genre', StringType(), True),
    StructField('game_system', StringType(), True),
    StructField('episode_count', IntegerType(), True),
    StructField('tot_video_length', IntegerType(), True),
    StructField('avg_video_length', DoubleType(), True),
    StructField('avg_view_count', DoubleType(), True),
    StructField('total_view_count', IntegerType(), True),
])
campaigns_data = spark.read.csv('campaigns_data.csv', schema=campaign_schema)
campaigns_data.show(20, truncate=False)

+----------------------------------+------------------------------------+------------------------------------------------+------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+-------------+----------------+------------------+------------------+----------------+
|playlist_id                       |campaign_uuid                       |campaign_name                                   |campaign_age|channel_age|genre     

### Creating Feature Vectors

Here we split the columns from the dataframe into our numerical, categorical, and textual feature vectors.  

#### Textual Feature Vector

In order to use genres and keywords in the model in the simplest way possible, I turned the keywords data field into a tfidf vector. Tfidf stands for Term Frequency Inverse Document Frequency.

In [None]:
# Fix the text data in the keyword column
campaigns_data = campaigns_data.withColumn("genre_set", split(lower(regexp_replace(col("genre"),"[',\\[\\]]", '')),','))

In [None]:
# Hash and transform the text data
hashing_tf = HashingTF(inputCol="genre_set", outputCol="genre_feature", numFeatures=1000)
tf_data = hashing_tf.transform(campaigns_data)

# Then do an IDF
idf = IDF(inputCol="genre_feature", outputCol="genre_tfidf")
idf_model = idf.fit(tf_data)
tfidf_data = idf_model.transform(tf_data)

#### Categorical Feature
The game system will utilize onehot encoding for each game system type.

In [None]:
# Create the categorical feature from the data
indexer = StringIndexer(inputCol="game_system", outputCol="game_system_index")
indexed_data = indexer.fit(tfidf_data).transform(tfidf_data)

game_system_encoder = OneHotEncoder(inputCol="game_system_index", outputCol="game_system_onehot")
gs_encoded = game_system_encoder.fit(indexed_data).transform(indexed_data)

#### Numerical Features

Since we are using Spark, all we need to do is list the names of the columns to tell it to combine into the VectorAssembler.

In [None]:
# Numerical features
numerical_cols = ["channel_age","episode_count","tot_video_length","avg_length_of_video"]

### Combining Vectors

In [None]:
assembler = VectorAssembler(
    inputCols=numerical_cols + ["genre_tfidf", "game_system_onehot"],
    outputCol="features"
)
final_data = assembler.transform(gs_encoded)

## Train Test Split

My aim is to use an 80% train size and a 20% test size. 

In [None]:
# Test size parameter
t_size = 0.2
seed = 1123

# Create independent and dependent data frames
train_df, test_df = final_data.randomSplit(weights=[1-t_size,t_size], seed=seed)

### Dependent Y Variable

I left this so that I could switch what the dependent variable was if I want to predict for average view count in a campaign or a total view count. 

In [None]:
# Create the dependent variable series
# dependent_col = "total_view_count"
dependent_col = "avg_view_count"

## Random Forest Regressor Model

In [None]:
# Create a Random Forest Classifier
rf_regressor = RandomForestRegressor(featuresCol="features", labelCol=dependent_col)

# Build the pipeline
rf_model = rf_regressor.fit(train_df)

# Make predictions on the test set
predictions = rf_model.transform(test_df)
predictions.show(truncate=False)

In [None]:
# Evaluate the model
evaluator = RegressionEvaluator(labelCol=dependent_col, predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"{predictions}")