# Youtube comments analysis
In this notebook, we have a dataset of user comments for youtube videos related to animals or pets. We will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the most viewers that are cat or dog owners.

The dataset are comments for videos related to animals and/or pets and is 240MB compressed.

In [3]:
link: https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing
command: wget https://drive.google.com/file/d/1o3DsS3jN_t2Mw3TsV0i7ySRmh9kyYi1a/view?usp=sharing

## Data Exploration and Cleaning

In [5]:
df_unclean = spark.read.csv("/FileStore/tables/animals_comments.csv",inferSchema=True,header=True)
df_unclean.show(10)

In [6]:
df_unclean.count() 

In [7]:
from pyspark.sql.functions import isnan, when, count, col, isnull

df_unclean.select([count(when(isnull('comment'), 'comment')).alias('commentIsNull')]).show()
#df_unclean.where(col("comment").isNull()).count()

In [8]:
df_clean = df_unclean.na.drop(subset=["comment"])
df_clean.count()

In [9]:
df_clean.show()

In [10]:
display(df_clean)

creator_name,userid,comment
Doug The Pug,87.0,I shared this to my friends and mom the were lol
Doug The Pug,87.0,Super cute 😀🐕🐶
bulletproof,530.0,stop saying get em youre literally dumb . have some common sense or dont own this kind of dog. fucking retarded I swear
Meu Zoológico,670.0,Tenho uma jiboia e um largato
ojatro,1031.0,I wanna see what happened to the pigs after that please
Tingle Triggers,1212.0,Well shit now Im hungry
Hope For Paws - Official Rescue Channel,1806.0,when I saw the end it said to adopt I saw different animal sites I was mad that they separated the cute little pups after being together for a long time
Hope For Paws - Official Rescue Channel,2036.0,Holy crap. That is quite literally the most adorable pup Ive ever seen.
Life Story,2637.0,武器はクエストで貰えるんじゃないんですか？
Brian Barczyk,2698.0,Call the teddy Larry


In [11]:
# find user with preference of dog and cat
from pyspark.sql.functions import isnan, when, count, col, isnull

df_clean = df_clean.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

In [12]:
df_clean.show()

In [13]:
print("The number of data with label 1:", df_clean.filter(col('label') == 1).count())
print("The number of data with label 0:", df_clean.filter(col('label') == 0).count())

Notice that the label is very unbalanced in the orginal dataset and the pipeline transformation is pretty slow on the whole dataset. So here I took 70% of the data with label 1 and 5% of the data with label 0 to do further analysis.

In [15]:
df_clean_tmp1 = df_clean.filter(col('label') == 1).randomSplit([0.7, 0.3], seed = 100)
df_clean_tmp0 = df_clean.filter(col('label') == 0).randomSplit([0.005, 0.995], seed = 100)

df_clean_sub = df_clean_tmp1[0].union(df_clean_tmp0[0])

In [16]:
df_clean_sub.count()

The new sample data has 56905 observations.

## Data Preprocessing

In [19]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec

# regular expression tokenizer
# Python 3: "word character": Unicode letter, ideogram, digit, or underscore
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
word2Vec = Word2Vec(inputCol="words", outputCol="features")

In [20]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_clean_sub)
dataset = pipelineFit.transform(df_clean_sub)

In [21]:
#(lable0_train,lable0_test) = dataset.filter(col('label') == 1).randomSplit([0.7, 0.3],seed = 100)
#(lable1_train, lable1_ex) = dataset.filter(col('label') == 0).randomSplit([0.005, 0.995],seed = 100)
#(lable1_test, lable1_ex2) = lable1_ex.randomSplit([0.002, 0.998],seed = 100)

(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)

Confirmed that the training data and test data has balanced labels.

In [23]:
print("Dataset Count: " + str(dataset.count()))
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))
print("Training Dataset with Label 1 Count: " + str(trainingData.filter(col('label') == 1).count()))
print("Training Dataset with Label 0 Count: " + str(trainingData.filter(col('label') == 0).count()))
print("Test Dataset with Label 1 Count: " + str(testData.filter(col('label') == 1).count()))
print("Test Dataset with Label 1 Count: " + str(testData.filter(col('label') == 0).count()))

## Models

### LogisticRegression

In [26]:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
evaluator = BinaryClassificationEvaluator(rawPredictionCol = "rawPrediction", labelCol = "label")

In [27]:
lr = LogisticRegression(labelCol = "label", featuresCol = "features", maxIter = 10)

model = lr.fit(trainingData)

prediction_lr_train = model.transform(trainingData)
prediction_lr = model.transform(testData)

In [28]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol = "rawPrediction", labelCol = "label")

print("The area under ROC for train set with logistic regression before CV is {}".format(evaluator.evaluate(prediction_lr_train)))
print("The area under ROC for test set with logistic regression before CV is {}".format(evaluator.evaluate(prediction_lr)))

#### Parameter Tuning and K-fold cross-validation

In [30]:
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [5, 10, 15])
             .build())

cv = CrossValidator(estimator = lr, estimatorParamMaps = paramGrid, evaluator = evaluator, numFolds = 5)

# Run cross validations
cvModel = cv.fit(trainingData)
prediction_lr_best = cvModel.transform(testData)

print("The area under ROC for test set with logistic regression after CV is {}".format(evaluator.evaluate(prediction_lr_best)))

### RandomForest

In [32]:
# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)

# Use test set here so we can measure the accuracy of our model on new data
prediction_rf = rfModel.transform(testData)

# cvModel uses the best model found from the Cross Validation
# Evaluate best model
print("The area under ROC for test set with random forest before CV is {}".format(evaluator.evaluate(prediction_rf)))

#### Parameter Tuning and K-fold cross-validation

In [34]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 40, 60])
             .addGrid(rf.numTrees, [5, 15, 25])
             .build())

# Create 5-fold CrossValidator
cv = CrossValidator(estimator = rf, estimatorParamMaps = paramGrid, evaluator = evaluator, numFolds = 5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(trainingData)

# Use test set here so we can measure the accuracy of our model on new data
prediction_rf_best = cvModel.transform(testData)

# cvModel uses the best model found from the Cross Validation
# Evaluate best model
print("The area under ROC for test set with random forest after CV is {}".format(evaluator.evaluate(prediction_rf_best)))

### Gradient boosting

In [36]:
# Create an initial RandomForest model.
gbt = GBTClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
gbtModel = gbt.fit(trainingData)

# Use test set here so we can measure the accuracy of our model on new data
prediction_gbt = gbtModel.transform(testData)

# cvModel uses the best model found from the Cross Validation
# Evaluate best model
print("The area under ROC for test set with gradient boosting tree before CV is {}".format(evaluator.evaluate(prediction_gbt)))

#### Parameter Tuning and K-fold cross-validation

In [38]:
gbt = GBTClassifier(labelCol="label", featuresCol="features")

paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

# Create 5-fold CrossValidator
cv = CrossValidator(estimator = gbt, estimatorParamMaps = paramGrid, evaluator = evaluator, numFolds = 5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(trainingData)

# Use test set here so we can measure the accuracy of our model on new data
prediction_gbt_best = cvModel.transform(testData)

# cvModel uses the best model found from the Cross Validation
# Evaluate best model
print("The area under ROC for test set with gradient boosting tree after CV is {}".format(evaluator.evaluate(prediction_gbt_best)))

## Get the best model with best hyper-parameter

In [40]:
import pandas as pd
roc_result = pd.DataFrame({"Logistic Regression": [0.9463, 0.9515], "Random Forest": [0.9252, 0.9253], "Gradient Boost": [0.9383, 0.9525], "Index": ["Before CV", "After CV"]})
roc_result.set_index("Index")

So based on the dataframe above, before cross validataion, logistic regression has the best performance in predicting the pet owner. But after cross validation, gradient boost tree has the best AUC and the prediction performance increases 1.5%. In general, I think these three models all did good job in prediction since the maximum of AUC is 1 and the AUC for these three models are pretty close to 1 even before the cross validation.

In [42]:
print("The best maximum depth is:", cvModel.bestModel._java_obj.getMaxDepth())
print("The best maximum bins is:", cvModel.bestModel._java_obj.getMaxBins())
print("The best maximum iteration is:", cvModel.bestModel._java_obj.getMaxIter())

The best hyperparameter for the gradient boost tree is maximum depth = 6, maximum bins = 20, maximum iteration = 20 and with other parameters set to default.

## Apply the model

### Classify All The Users

Gathering the rest of unseen data from the original dataset.

In [47]:
df_clean_rest = df_clean_tmp1[1].union(df_clean_tmp0[1])

In [48]:
print("Rest Dataset Count: " + str(df_clean_rest.count()))
df_clean_rest.show(5)
#print("Rest Dataset with Label 1 Count: " + str(df_clean_rest.filter(col('label') == 1).count()))
#print("Rest Dataset with Label 0 Count: " + str(df_clean_rest.filter(col('label') == 0).count()))

Building a new pipeline that includes the best model

In [50]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec
from pyspark.ml import Pipeline

#Create final pipeline
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")
word2Vec = Word2Vec(inputCol="words", outputCol="features")
gbt_final = GBTClassifier(labelCol="label", featuresCol="features", maxDepth = 6, maxBins = 20, maxIter = 20)

pipeline_final = Pipeline(stages=[regexTokenizer, word2Vec, gbt_final])

# Fit the pipeline to train+test to get final model.
pipelineFit_final = pipeline_final.fit(df_clean_sub)
prediction_final = pipelineFit_final.transform(df_clean_rest)

The number of predicted Dog/Cat Owner is 564468, which is 22% of the total user. Compared with using only the key word to classify pet owner, the percentage is only 0.4% of the total user.

In [52]:
print("Total User: " + str(prediction_final.select('userid').distinct().count()))
print("Predicted Number of Dog/Cat Owner: " + str(prediction_final.filter(col('prediction') == 1).select('userid').distinct().count()))
print("Predicted Number of Non-Dog/Cat Owner: " + str(prediction_final.filter(col('prediction') == 0).select('userid').distinct().count()))

In [53]:
print("Rest Dataset with label 1 Count: " + str(prediction_final.filter(col('label') == 1).select('userid').distinct().count()))
print("Rest Dataset with label 0 Count: " + str(prediction_final.filter(col('label') == 0).select('userid').distinct().count()))

#### 3. Get insigts of Users

In [55]:
display(prediction_final.filter(col('prediction') == 1).select('comment', 'prediction'))

comment,prediction
I need two of these Donald pet toys one of for my cat and one for my dog.,1.0
hey Aron I made a video with my cat! You inspired me !!! I am a big fan of you! Do you have any tips for me? thank you!,1.0
I love the show my cat from hell #mycatfromhell,1.0
my cat died like if you agree,1.0
lol haha! This is so well made! I showed my cat loving cousins this and they died of laughter later on there channel is gonna be “Babycat Vlogs” if you can sub to them it would be greatly appreciated. The channel isn’t yet made but that’s what it will be called.,1.0
I did the same thing on my catI caught her sneaking out and buying drugs off a junkie in the alleyway then she smoked weed and also became the leader of the cat mafia,1.0
Ur cat is savage when i even try to put on a shirt with my cat i will become pirate hook,1.0
So a broad-headed skink got into my house and its hiding somewhere in my room I cant find it; any tips on how to lure it out of wherever its hiding? I dont want my dogs or cats to get it!,1.0
Lucy is so cute. she was looking at Bruce and Dexter to see if they were going to go. my dogs are like that too. lazy bums. hahaha,1.0
Is it normal for my dog to stick her snout on the back edge of the water bowl and lick the water up using the side of the bowl?,1.0


#### 4. Identify Creators With Cat And Dog Owners In The Audience

Brave Wilderness, Brian Barczyk, The Dodo and Taylor Nicole Dean are the creators with top 4 number of pet owners audience.

In [58]:
display(prediction_final.filter(col('prediction') == 1).select('creator_name').groupby('creator_name').count().orderBy('count', ascending=False))

creator_name,count
Brave Wilderness,150315
Brian Barczyk,70197
The Dodo,55844
Taylor Nicole Dean,52698
Robin Seplut,21836
Hope For Paws - Official Rescue Channel,21240
Vet Ranch,18538
Gohan The Husky,17077
Viktor Larkhill,15018
Think Like A Horse,14559


The percentage of creators with cat and dog owners in the audience is 74.26%.

In [60]:
print("Total Number of Youtube Channel: " + str(prediction_final.select('creator_name').distinct().count()))

In [61]:
print("The Number of Youtube Channel with Dog/Cat Owner Comment: " + str(prediction_final.filter(col('prediction') == 1).select('creator_name').distinct().count()))

In [62]:
prediction_final.createOrReplaceTempView("pred_tmp")

In [63]:
%sql
select distinct p.creator_name, p1.sub_count, pt.total_count, p1.sub_count / pt.total_count as percentage
from pred_tmp as p inner join 
(select creator_name, count(distinct userid) as sub_count from pred_tmp where prediction = 1 group by 1) as p1 
on p.creator_name = p1.creator_name inner join
(select creator_name, count(distinct userid) as total_count from pred_tmp group by 1) as pt
on p.creator_name = pt.creator_name
where p.prediction = 1
order by sub_count desc;

creator_name,sub_count,total_count,percentage
Brave Wilderness,128004,633267,0.2021327496932573
Taylor Nicole Dean,41587,134537,0.309111991496763
The Dodo,39654,164977,0.240360777562933
Brian Barczyk,38834,135460,0.2866824154732024
Hope For Paws - Official Rescue Channel,18762,90601,0.2070838070219975
Vet Ranch,15698,65124,0.2410478471838339
Gohan The Husky,14563,63435,0.2295735792543548
Robin Seplut,12499,46051,0.2714164730407591
ViralHog,11278,70445,0.1600965292071829
Cole & Marmalade,9473,30751,0.308055022600891


In [64]:
%sql
select distinct p.creator_name, p1.sub_count, pt.total_count, p1.sub_count / pt.total_count as percentage
from pred_tmp as p inner join 
(select creator_name, count(distinct userid) as sub_count from pred_tmp where prediction = 1 group by 1) as p1 
on p.creator_name = p1.creator_name inner join
(select creator_name, count(distinct userid) as total_count from pred_tmp group by 1) as pt
on p.creator_name = pt.creator_name
where p.prediction = 1 and pt.total_count > 1000
order by sub_count desc;

creator_name,sub_count,total_count,percentage
Brave Wilderness,128004,633267,0.2021327496932573
Taylor Nicole Dean,41587,134537,0.309111991496763
The Dodo,39654,164977,0.240360777562933
Brian Barczyk,38834,135460,0.2866824154732024
Hope For Paws - Official Rescue Channel,18762,90601,0.2070838070219975
Vet Ranch,15698,65124,0.2410478471838339
Gohan The Husky,14563,63435,0.2295735792543548
Robin Seplut,12499,46051,0.2714164730407591
ViralHog,11278,70445,0.1600965292071829
Cole & Marmalade,9473,30751,0.308055022600891


#### 5. Analysis and Future work

In this project, the dataset of user comments for youtube videos related to animals or pets with 5820035 observations is used to identify cat or dog owners by using mainly pyspark ml package deployed on Databricks.

In the data processiong part, the observations with null comments were dropped and the new dataset ended up with 5818984 observations. Then based on the comments, some key words like "my cat", "my dog" and "my puppy" was created to first label the observaitons with dog/cat owners as label 1 and non dog/cat owners as lebel 0. After labeling the dataset, the train set and test set were generated for the model fitting part. Notice that data is very unbalanced in terms of the labeling, so in order to have a more balanced data, I first splitted the data with only label 1 and label 0  to get approximately equal labelling and then combined them to do the future split for train set and test set. So, the data that I ended up being used has the following information in terms of the labelling and observation number:

Dataset Count: 56905
Training Dataset Count: 39958
Test Dataset Count: 16947
Training Dataset with Label 1 Count: 19910
Training Dataset with Label 0 Count: 20048
Test Dataset with Label 1 Count: 8378
Test Dataset with Label 1 Count: 8569

With the sample data ready, I did tokenization and vectorization on the data and then I tried logistic regression, random forest and gradient boosting tree these three models to classify the pet owners. For each of the models, I performed a 5-fold cross validations and using metric AUC to evaluate the performance of the classifiers. The result is shown below:

               Gradient Boost  Logistic Regression  Random Forest
    Index                                                        
    Before CV          0.9383               0.9463         0.9252
    After CV           0.9525               0.9515         0.9253

So based on the dataframe above, before cross validataion, logistic regression has the best performance in predicting the pet owner. But after cross validation, gradient boost tree has the best AUC and the prediction performance increases 1.5%. In general, I think these three models all did good job in prediction since the maximum of AUC is 1 and the AUC for these three models are pretty close to 1 even before the cross validation. 

The best hyperparameter for the gradient boost tree is maximum depth = 6, maximum bins = 20, maximum iteration = 20 and with other parameters set to default. 

After having the best model, I built a new pipeline that also includes the best model to fit the whole data that I used to do the model training and then transformed it on the rest of the unseen data which has 5762079 observations.

The number of predicted Dog/Cat Owner is 564468, which is 22% of the total user. Compared with using only the key word to classify pet owner, the percentage is only 0.4% of the total user.

In terms of the comments of pet/dog owners, by reading some of their comments, I found that most of the video they are watching are just entertaining and most of the audience expressed their feelings towards the video. So a sentimental analysis based on the comments can further be done (having hard time working this on spark since it is pretty slow…. But still trying…. ). 

In terms of the youtuber, the percentage of creators with cat and dog owners in the audience is 74.26% and Brave Wilderness, Brian Barczyk, The Dodo and Taylor Nicole Dean are the creators with top 4 number of pet owners audience and meanwhile, these 4 youtubers has a relatively high number of subscribers (over 1M). Since they have a very large amount of audience, so the chance of having the pet owners will be also higher compared to those youtubers with less audience. Thus, it is not surprising that the percentage of pet owners for these youtubers is around 20%~30%. I then restricted the total number of comment audience to be over 1000 (considered as actively youtubers). StormyRabbits, wingsNpaws, Menthol, Kratom, RaleighLink14 and Think Like A Horse has approximately 50% of their audience as pet owners.

Based on the analysis above, some recommendations can be made. 
1. Youtube Video Recommendation (You may like section): after performing the sentimental analysis on the comments, Youtube can know the preference of the audience in more detail since it’s describing in words.)So based on this information, the video recommendation can be more precise along with the play history for that user. 
2. Ads Recommendation: Classifying the audience with different labels can help Youtube better promote the ads. For example, in this case, if the audience is predicted as pet owner, then the audience can be more likely to click a pet suppliance related ads especially for those youtubers whose majority audience are pet owners. And the ads can even appear on the google search page or gmail page to increase the click rate.