## Youtube Data Analysis

In this notebook, I have a dataset of user comments for youtube videos related to animals or pets. I will attempt to identify cat or dog owners based on these comments, find out the topics important to them, and then identify video creators with the highest fraction of cat or dog owner users who I can recommend video creators to.

Step 1: Identify cat And dog owners and find the users who are cat and/or dog owners.

Step 2: Build and evaluate classifiers for the cat and dog owners and measure the performance of the classifiers.

Step 3: Apply the best classifier to all the users in the dataset. Estimate the fraction of all users who are cat/dog owners.

Step 4: Extract insights about cat and dog owners through finding topics important to cat and dog owners.

Step 5: Identify creators with pet owners in the audience. Find creators whose video commented by the most cat and/or dog owners. In other words, creators with the highest statistically significant percentages of cat and/or dog owners.

## Step 1: Identify Cat And Dog Owners and find the users who are cat and/or dog owners.

#### 1.1. Data Exploration and Cleaning

In [5]:
df_clean=spark.read.csv("/FileStore/tables/animals_comments.csv",inferSchema=True,header=True)
display(df_clean.head(10))

creator_name,userid,comment
Doug The Pug,87.0,I shared this to my friends and mom the were lol
Doug The Pug,87.0,Super cute 😀🐕🐶
bulletproof,530.0,stop saying get em youre literally dumb . have some common sense or dont own this kind of dog. fucking retarded I swear
Meu Zoológico,670.0,Tenho uma jiboia e um largato
ojatro,1031.0,I wanna see what happened to the pigs after that please
Tingle Triggers,1212.0,Well shit now Im hungry
Hope For Paws - Official Rescue Channel,1806.0,when I saw the end it said to adopt I saw different animal sites I was mad that they separated the cute little pups after being together for a long time
Hope For Paws - Official Rescue Channel,2036.0,Holy crap. That is quite literally the most adorable pup Ive ever seen.
Life Story,2637.0,武器はクエストで貰えるんじゃないんですか？
Brian Barczyk,2698.0,Call the teddy Larry


In [6]:
df_clean.count() 

In [7]:
df_clean = df_clean.na.drop(subset=["comment"])
df_clean = df_clean.dropDuplicates()

##### Filter those non-English comments out

In [9]:
import nltk
nltk.download('stopwords')

ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS
 
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}

def is_english(text):
    text = text.lower()
    words = set(nltk.wordpunct_tokenize(text))
    return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS)

In [10]:
# from langdetect import detect
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, BooleanType, DoubleType 

classify_lang = udf(lambda x: is_english(x), BooleanType())

In [11]:
df_w_lang = df_clean.withColumn('is_english', classify_lang('comment'))

df_only_en = df_w_lang.filter(df_w_lang.is_english==True)
df_only_en = df_only_en.select('creator_name', 'userid', 'comment')
df_only_en.count()

In [12]:
display(df_only_en.take(10))

creator_name,userid,comment
Gohan The Husky,190661.0,From Germany! Love u r videos and Gohan so much 😍
Hope For Paws - Official Rescue Channel,1806.0,when I saw the end it said to adopt I saw different animal sites I was mad that they separated the cute little pups after being together for a long time
Brian Barczyk,2698.0,Call the teddy Larry
Taylor Nicole Dean,2432708.0,I am interested in getting a Hedgehog. So it would be so helpful if you could make a more updated video about how to take care of a hedgehog.
The Dodo,1424576.0,Really? You have to steal other peoples content to get views? This is just sad. One link and not even to her YouTube. Also at least have the DECENCY to actually ask if you can use it!
Hope For Paws - Official Rescue Channel,190054.0,Kinda sad that their mother isnt still found
Talking Kitty Cat,2911.0,steve: No wet food for a month!:cats immediately stop fighting:
Robin Seplut,294406.0,Aww! What a cute and fluffy cat❤️
Hope For Paws - Official Rescue Channel,2036.0,Holy crap. That is quite literally the most adorable pup Ive ever seen.
Hope For Paws - Official Rescue Channel,2911.0,Its people like Hope For Paws who truly make the world a better place <3


##### Use Vader Sentiment Analysis to get sentiment score (positive: score>=0.05, neutral: -0.05<score<0.05, negative: score<=-0.05)

In [14]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()
sentiment = udf(lambda x: sid.polarity_scores(x)['compound'], DoubleType())

df_w_comment_score = df_only_en.withColumn('sentiment_score', sentiment('comment'))
# print(sid.polarity_scores('From Germany! Love u r videos and Gohan so much'))

In [15]:
def score_to_class(score):
  if score>=0.05:
    return 'Positive'
  elif -0.05<score<0.05:
    return 'neutral'
  else:
    return 'negative'

sentiment_classify = udf(lambda x: score_to_class(x), StringType())
df_w_comment = df_w_comment_score.withColumn('sentiment', sentiment_classify('sentiment_score'))

In [16]:
display(df_w_comment.head(10))

creator_name,userid,comment,sentiment_score,sentiment
Gohan The Husky,190661.0,From Germany! Love u r videos and Gohan so much 😍,0.6696,Positive
Hope For Paws - Official Rescue Channel,1806.0,when I saw the end it said to adopt I saw different animal sites I was mad that they separated the cute little pups after being together for a long time,0.128,Positive
Brian Barczyk,2698.0,Call the teddy Larry,0.0,neutral
Taylor Nicole Dean,2432708.0,I am interested in getting a Hedgehog. So it would be so helpful if you could make a more updated video about how to take care of a hedgehog.,0.8596,Positive
The Dodo,1424576.0,Really? You have to steal other peoples content to get views? This is just sad. One link and not even to her YouTube. Also at least have the DECENCY to actually ask if you can use it!,-0.7877,negative
Hope For Paws - Official Rescue Channel,190054.0,Kinda sad that their mother isnt still found,-0.4228,negative
Talking Kitty Cat,2911.0,steve: No wet food for a month!:cats immediately stop fighting:,-0.7345,negative
Robin Seplut,294406.0,Aww! What a cute and fluffy cat❤️,0.5093,Positive
Hope For Paws - Official Rescue Channel,2036.0,Holy crap. That is quite literally the most adorable pup Ive ever seen.,0.2247,Positive
Hope For Paws - Official Rescue Channel,2911.0,Its people like Hope For Paws who truly make the world a better place <3,0.9201,Positive


In [17]:
# find user with preference of dog and cat
from pyspark.sql.functions import when
from pyspark.sql.functions import col

# you can user your ways to extract the label

df_classify_user = df_w_comment.withColumn("label", \
                           (when(col("comment").like("%my dog%"), 1) \
                           .when(col("comment").like("%I have a dog%"), 1) \
                           .when(col("comment").like("%my cat%"), 1) \
                           .when(col("comment").like("%I have a cat%"), 1) \
                           .when(col("comment").like("%my puppy%"), 1) \
                           .when(col("comment").like("%my pup%"), 1) \
                           .when(col("comment").like("%my kitty%"), 1) \
                           .when(col("comment").like("%my pussy%"), 1) \
                           .otherwise(0)))

# (large,small)=df_clean.filter(col('label')==1).randomSplit([0.99, 0.01],seed = 100)
# df_clean = small
df_classify_user.createOrReplaceTempView('iden_owner')

In [18]:
%sql

Select label,count(*)
From iden_owner
Group by 1


label,count(1)
1,39265
0,4234953


#### 1.2 Data preprocessing including numeric vector feature generation and obtain training and test data

In [20]:
from pyspark.ml.feature import RegexTokenizer, Word2Vec, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression

# regular expression tokenizer
regexTokenizer = RegexTokenizer(inputCol="comment", outputCol="words", pattern="\\W")

remover = StopWordsRemover(inputCol="words", outputCol="filtered")

# countvec = CountVectorizer(inputCol='filtered', outputCol='features')
word2Vec = Word2Vec(inputCol="filtered", outputCol="features",vectorSize=20)

In [21]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, remover, word2Vec])

# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df_classify_user)
dataset = pipelineFit.transform(df_classify_user)

In [22]:
cate_num = dataset.groupBy("label").count()
cate_num.show()

In [23]:
(label0_train,label0_test)=dataset.filter(col('label')==1).randomSplit([0.7, 0.3],seed = 100)
(label1_train, label1_ex)=dataset.filter(col('label')==0).randomSplit([0.0065, 0.9935],seed = 100)
(label1_test, label1_ex2)=label1_ex.randomSplit([0.003, 0.997],seed = 100)

In [24]:
trainingData = label0_train.union(label1_train)
testData=label0_test.union(label1_test)

In [25]:
# trainingData.write.parquet("dbfs:/Filestore/training.parquet")

In [26]:
display(trainingData.head(10))

creator_name,userid,comment,sentiment_score,label,words,filtered,features
Aarons Animals,2285699.0,Never gonna happen not on my dogs watch you filthy communist felines,0.0,1,"List(never, gonna, happen, not, on, my, dogs, watch, you, filthy, communist, felines)","List(never, gonna, happen, dogs, watch, filthy, communist, felines)","List(1, 20, List(), List(0.2503337822854519, 0.18225614447146654, -0.20774291269481182, -0.03516801632940769, -0.11782620957819745, -0.05884992051869631, 0.17715225648134947, 0.04842398362234235, -0.2797582296188921, -0.1657107884529978, 0.0012020301073789597, 0.04911849740892649, 0.004935234959702939, 0.03571598790585995, -0.043676303466781974, 0.09205120848491788, 0.10440967440081295, 0.15981378080323339, -0.19839006755501032, -0.0180969103530515))"
Alex Knappenberger,517292.0,Kid thinks he is cool: My dog could grab his water bottle and drink it out of there how about your dog? Me: my dog!? Well he could .... Umm....... 0:10 .....yeah um my dog is just a normal dog 😐,0.6189,1,"List(kid, thinks, he, is, cool, my, dog, could, grab, his, water, bottle, and, drink, it, out, of, there, how, about, your, dog, me, my, dog, well, he, could, umm, 0, 10, yeah, um, my, dog, is, just, a, normal, dog)","List(kid, thinks, cool, dog, grab, water, bottle, drink, dog, dog, well, umm, 0, 10, yeah, um, dog, normal, dog)","List(1, 20, List(), List(0.16871092593493428, -0.023978168438923985, -0.0877951397315452, 0.12337067386282509, -0.11963025603051247, 0.0819256623324595, 0.04887496969221454, -0.04836835398485786, 0.07178512686177303, -0.3980460970809585, 0.07406207296605173, -0.09226672684675769, 0.16203258696355316, -0.005787633163364309, 0.10812565488250632, -0.04319395634688829, 0.08781321890848247, 0.25075001544074005, -0.07439040598508559, 0.09059439266198559))"
Brave Wilderness,212789.0,The crab is cuter than my dog and when my mom says wanna take the dog for a walk im like dog where? all I see is a beached whale,0.7003,1,"List(the, crab, is, cuter, than, my, dog, and, when, my, mom, says, wanna, take, the, dog, for, a, walk, im, like, dog, where, all, i, see, is, a, beached, whale)","List(crab, cuter, dog, mom, says, wanna, take, dog, walk, im, like, dog, see, beached, whale)","List(1, 20, List(), List(0.13986005022500952, -0.03907567262649536, -0.13026724041750035, 0.14239789785351603, -0.2743004396829444, 0.029896076533865803, 0.019119252761205036, -0.003630560946961244, -0.08519598341857393, -0.25708630892137685, 0.06175308053692182, -0.038810342798630396, 0.06268145975967249, 0.02200352089324345, 0.06462769278635581, 0.02750047544638316, 0.0023026781777540843, 0.25796371440325555, -0.06836107938239971, 0.24448468585809072))"
Brave Wilderness,1008240.0,Omg it’s so adorable it sounds like my cat,0.7661,1,"List(omg, it, s, so, adorable, it, sounds, like, my, cat)","List(omg, adorable, sounds, like, cat)","List(1, 20, List(), List(0.23596538454294205, -0.1652421295642853, -0.2857926726341248, 0.2929233313770965, 0.15470136544900015, 0.11412937790155411, 0.2674281768500805, -0.1548633225262165, 0.11761091947555542, -0.12404463849961758, 0.21599177354946733, 0.17296529263257981, 0.11921237334609032, 0.13287490345537664, 0.4395999148488045, -0.12394305430352688, -0.09271230548620224, 0.12991311959922314, 0.10959469443187118, 0.47557075321674347))"
Brian Barczyk,124380.0,Arty will get better! I will pray for you guys ! I was so freaked out when my dog started to throw up 🤢 but she is moving and shaking like new now ! Best of luck ! -Kaitlyn 😘❤️🐶❤️😘 ps tell arty that we all love him and Suzee,0.9707,1,"List(arty, will, get, better, i, will, pray, for, you, guys, i, was, so, freaked, out, when, my, dog, started, to, throw, up, but, she, is, moving, and, shaking, like, new, now, best, of, luck, kaitlyn, ps, tell, arty, that, we, all, love, him, and, suzee)","List(arty, get, better, pray, guys, freaked, dog, started, throw, moving, shaking, like, new, best, luck, kaitlyn, ps, tell, arty, love, suzee)","List(1, 20, List(), List(0.11461314811770404, -0.08029413019262609, -0.19567887077019328, -0.17967733667076874, -0.06478360350454403, -0.1057182467054753, -0.07276723782221475, 0.09881568150151343, 0.05278165638446808, -0.1525043768896943, -0.15799475314893893, -0.09533993851038672, 0.058209481959541634, -0.06700838285143532, 0.022027385598492055, 0.15797647310509566, -0.05816501857978957, 0.12939818008314996, 0.0884762971351544, 0.19146110507703962))"
Brian Barczyk,410371.0,that happend to my dog too...😢😢😢😢😢😢😢😢😢😢😢😢😢😢,0.0,1,"List(that, happend, to, my, dog, too)","List(happend, dog)","List(1, 20, List(), List(0.1629960648715496, -0.1909930817782879, -0.1471198219805956, 0.022979609668254852, -0.17798766866326332, 0.1892341710627079, 0.3672676645219326, -0.00673467293381691, -0.27548253536224365, -0.44878503680229187, -0.0676597859710455, -0.0843239352107048, 0.3919028639793396, 0.03761276602745056, -0.015659631812013686, 0.15045686438679695, 0.034597308840602636, 0.3092934153974056, -0.12808331847190857, 0.35601937770843506))"
Brian Barczyk,1766603.0,Im really crying because I have a dog thats a husky I dont want nothing to happen to her like youre dog I had a other dog who was bearly a baby and he died,-0.6922,1,"List(im, really, crying, because, i, have, a, dog, thats, a, husky, i, dont, want, nothing, to, happen, to, her, like, youre, dog, i, had, a, other, dog, who, was, bearly, a, baby, and, he, died)","List(im, really, crying, dog, thats, husky, dont, want, nothing, happen, like, youre, dog, dog, bearly, baby, died)","List(1, 20, List(), List(0.04498636667781016, -0.09766628154936959, -0.15784926486530287, -0.07159174405558802, -0.15149763334364466, 0.062069972989026126, 0.15878871099694686, 0.08035841162371285, -0.11697404130416758, -0.3040613972527139, 0.08365890580941648, -0.023781106752507827, 0.1125971130643259, 0.05339007026871995, 0.14301534462720156, 0.13431394823333795, -0.01886692555511699, 0.24448110568611062, -0.11056652511743938, 0.23487530033761525))"
Brian Barczyk,1996262.0,Number 5 because my cats are turning 5 next year,0.0772,1,"List(number, 5, because, my, cats, are, turning, 5, next, year)","List(number, 5, cats, turning, 5, next, year)","List(1, 20, List(), List(-0.06413719051384499, 0.35654981434345245, -0.0524111661527838, 0.1048979844365801, 0.2448391392827034, 0.08592584675976207, -0.04354365289743457, -0.23213474292840275, 0.012233639402048928, -0.27952118856566294, 0.31468853141580305, -0.5487539704356874, 0.08382923475333622, -0.03316056888018335, -0.05489969333367688, 0.13785512173282247, -0.12707588874867984, 0.212211512029171, 0.022241420511688505, 0.0453341631218791))"
Brian Barczyk,2351519.0,I feel you my dog had to be put down,0.0,1,"List(i, feel, you, my, dog, had, to, be, put, down)","List(feel, dog, put)","List(1, 20, List(), List(0.10272719711065292, 0.03389360383152962, -0.15924275542298952, -0.1692348668972651, -0.33177008231480914, 0.10639930376783013, 0.22677223881085712, 0.09729655832052231, -0.02961894373099009, -0.402711679538091, -0.049312603349486984, -0.020243217547734577, 0.19525713970263797, 0.1375209130346775, 0.1779693216085434, 0.10734600682432452, 0.1239811380704244, 0.14325454458594322, -0.00388297438621521, 0.10405414303143819))"
Bully Max,954499.0,Thats why I adopted a 2 year old dog from the shelter. he came already trained and I dont have to worry. Love my dog.,0.7654,1,"List(thats, why, i, adopted, a, 2, year, old, dog, from, the, shelter, he, came, already, trained, and, i, dont, have, to, worry, love, my, dog)","List(thats, adopted, 2, year, old, dog, shelter, came, already, trained, dont, worry, love, dog)","List(1, 20, List(), List(0.029368201004607335, -0.005417938211134502, -0.07576887354454291, -0.005407058500817844, 0.008314060579453195, 0.08390064404479095, 0.18061404834900582, -0.027521363592573574, -0.1854654892480799, -0.4246639534831047, 0.08279431824173246, -0.24366213807037898, 0.0598099700621008, 0.07538100491677012, -0.030051365361682004, 0.29548424761742353, -0.04198839781539781, 0.31759101205638474, 0.07491100477221023, 0.10984605510020629))"


In [27]:
display(trainingData.groupBy("label").count())
# view training data's number of each label

label,count
1,27505
0,27506


In [28]:
display(testData.groupBy("label").count())
# view test data's number of each label

label,count
1,11760
0,12637


## Step 2: Build And tune Classifiers for the cat and dog owners and measure the performance of the classifiers.

#### 2.1 Logistic Regression

In [31]:
from pyspark.ml.classification import GBTClassifier, GBTClassificationModel, LogisticRegression, LogisticRegressionModel, RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel

In [32]:
lr = LogisticRegression(maxIter=8, regParam=0.05, elasticNetParam=0.8)
lrModel = lr.fit(trainingData)

print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

In [33]:
evaluator = BinaryClassificationEvaluator()

print('AUC score for Logistic Regression without tuning the parameter: {}'.format(evaluator.evaluate(lrModel.transform(testData))))

##### Parameter Tuning and K-fold cross-validation

In [35]:
paramGrid = ParamGridBuilder()\
            .addGrid(lr.regParam,[0.01,0.05,0.1,0.2])\
            .addGrid(lr.maxIter,[8,10,12,14])\
            .build()
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=5,
                          seed = 42)

cvModel = crossval.fit(trainingData)


In [36]:
best_model = cvModel.bestModel
best_model.extractParamMap()

#### 2.2 Random Forest

In [38]:
trainingData.show(10)

In [39]:
rf = RandomForestClassifier(labelCol='label', featuresCol='features', seed=42)

paramGrid = ParamGridBuilder()\
            .addGrid(rf.numTrees,[64,100])\
            .addGrid(rf.maxDepth,[8,10])\
            .build()
crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=5,
                          seed = 42)
cv_rfModel = crossval.fit(trainingData)


In [40]:
best_rf_model = cv_rfModel.bestModel
best_rf_model.extractParamMap()
save_rf_path = 'Filestore/model/best_rf_model'
best_rf_model.save(rf_path)

#### 2.3 Gradient boosting

In [42]:
gbt = GBTClassifier(labelCol='label', featuresCol='features', seed=42)

paramGrid = ParamGridBuilder()\
            .addGrid(gbt.stepSize,[0.05,0.1])\
            .addGrid(gbt.maxIter,[8,10])\
            .addGrid(gbt.maxDepth,[5,6])\
            .build()
crossval = CrossValidator(estimator=gbt,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=5,
                          seed = 42)
cv_gbtModel = crossval.fit(trainingData)

#### 2.4 Get the best model with best hyper-parameter

##### Load the trained models

In [45]:
lr_path = 'Filestore/model/best_lr_model'
rf_path = 'Filestore/model/best_rf_model'
gbt_path = 'Filestore/model/best_gbt_model'
lr_model = LogisticRegressionModel.load(lr_path)
rf_model = RandomForestClassificationModel.load(rf_path)
gbt_model = GBTClassificationModel.load(gbt_path)

##### Evaluate models with the test dataset

In [47]:
evaluator = BinaryClassificationEvaluator()

rf_predict = rf_model.transform(testData)
print('AUC score for fine-tuned Random Forest model: {}'.format(evaluator.evaluate(rf_predict)))

In [48]:
lr_predict = lr_model.transform(testData)
print('AUC score for fine-tuned Logistic Regression model: {}'.format(evaluator.evaluate(lr_predict)))

Note: We can see the Logistic Regression model **improves its AUC score from 0.8516 to 0.8796** by tuning the parameter.

In [50]:
gbt_predict = gbt_model.transform(testData)
print('AUC score for fine-tuned Gradient-boosted Tree model: {}'.format(evaluator.evaluate(gbt_predict)))

In [51]:
display(rf_predict.groupby('prediction').count())

prediction,count
0.0,11626
1.0,12771


## Step 3: Apply the best classifier to all the users in the dataset. Estimate the fraction of all users who are cat/dog owners.

#### 3.1 Classify All The Users

In [54]:
prediction = rf_model.transform(dataset)

In [55]:
display(prediction.groupby('prediction').count())

prediction,count
0.0,3473228
1.0,800990


#### 3.2 Get fraction of pet owner users

In [57]:
prediction.createOrReplaceTempView('predict')

In [58]:
%sql

Select prediction, num/total as proportion
From
(Select prediction, count(*) as num
From predict
Group by 1) p
Cross Join (Select count(*) as total From predict) t


prediction,proportion
0.0,0.8125996381092401
1.0,0.1874003618907599


## Step 4: Extract insights about cat and dog owners through finding topics important to cat and dog owners.

#### 4.1 Use CountVectorizer to extract features instead of word2Vec in order to visualize words in each topic

In [61]:
pet_user = prediction.filter(col('prediction')==1)

In [62]:
pet_user.count()

##### Use nltk stemmer to get the stem of words

In [64]:
from nltk.stem.porter import *

# Instantiate stemmer object
stemmer = PorterStemmer()

# Create stemmer python function
def stem(in_vec):
    out_vec = []
    for t in in_vec:
        t_stem = stemmer.stem(t)
        if len(t_stem) > 2:
            out_vec.append(t_stem)       
    return out_vec

# Create user defined function for stemming with return type Array<String>
from pyspark.sql.types import *
stemmer_udf = udf(lambda x: stem(x), ArrayType(StringType()))

# Create new column with vectors containing the stemmed tokens 
pet_user = pet_user.withColumn("stem_word", stemmer_udf("filtered"))

In [65]:
text1 = pet_user.select("stem_word").rdd.flatMap(lambda a: a.stem_word).countByValue()
wordfreq = sorted(text1.items(),key=lambda x:x[1],reverse=True)

In [66]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = pet_user.select("stem_word").collect()
stopwords = ['cat','dog','like','love','get','got','one','look','know','want','kitti','puppi','lol','dont','never','much','realli','see','also','peopl','need','thing','think','even']

# words = " ".join([(k + " ")*v for k,v in text1.items()])
words = " ".join([word for row in text for word in row[0]])

wcloud = WordCloud(stopwords=stopwords,max_words=1000, background_color="white").generate(words)

fig,ax0=plt.subplots(nrows=1,figsize=(12,8))
ax0.imshow(wcloud,interpolation='bilinear')

ax0.axis("off")
display(fig)

In [67]:
from pyspark.ml.feature import CountVectorizer

countvec = CountVectorizer(inputCol='stem_word', outputCol='vectors')
count_model = countvec.fit(pet_user)
train = count_model.transform(pet_user)

#### 4.2 extract topics of pet owner users via LDA clustering

In [69]:
from pyspark.ml.clustering import LDA

lda = LDA(maxIter=30,k=20,featuresCol='vectors',seed=2)
lda_model = lda.fit(train)

In [70]:
from pyspark.ml.clustering import LocalLDAModel

path = 'dbfs:/Filestore/model/lda_model'
lda_model = LocalLDAModel.load(path)

In [71]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType

vocab = count_model.vocabulary

def trans(x):
  return [vocab[i] for i in x]

idx2word = udf(lambda y: trans(y), ArrayType(StringType()))

In [72]:
topics = lda_model.describeTopics(10).withColumn("terms",idx2word("termIndices"))
display(topics.select("topic","terms","termWeights"))

topic,terms,termWeights
0,"List(otherwis, manag, answer, success, heavi, prepar, highli, nala, disgust, heartbeat)","List(0.002753511239482618, 0.002361180601600332, 0.0023605552017384123, 0.0023599399793519356, 0.0019673764775817084, 0.0018396274859288472, 0.0015692128883641088, 0.0015396320784823022, 0.0013742680942309753, 0.001217344435571216)"
1,"List(loki, bitch, 100, florida, trade, otter, patient, respond, whisker, wobbl)","List(0.010569410356347215, 0.002156719946686762, 0.0019270164657736765, 0.0016627641023094688, 0.0013630404773171357, 0.0011038174574266577, 0.0010795811714814727, 0.0010525294700361012, 9.652333173937885E-4, 9.036942233495148E-4)"
2,"List(move, look, whoever, pitch, mango, shampoo, defect, jim, trend, italian)","List(0.020830716513999745, 0.02025594184659543, 0.0020201292260128034, 0.0011482460330011206, 9.83737719276245E-4, 8.45977331809995E-4, 8.333937944722888E-4, 8.030341715530662E-4, 7.387980444602307E-4, 7.311096724017083E-4)"
3,"List(gener, competit, translat, reinforc, tragic, aspca, overload, fever, esp, lovabl)","List(0.001393537145917369, 0.00112092834400439, 9.53036743811484E-4, 9.376355292220524E-4, 8.695307332582123E-4, 6.66637900947543E-4, 6.315636820609265E-4, 5.317197277654631E-4, 4.820940148643618E-4, 4.392757081611066E-4)"
4,"List(okay, shelter, youtub, breeder, chang, bro, mess, flip, tortois, who)","List(0.009792664545535958, 0.009262383257651326, 0.009181905767381926, 0.004030378055061363, 0.0032189419116253858, 0.002521323368543858, 0.002083706512657255, 0.002023072939436958, 0.0018736393857267516, 0.001852440589014055)"
5,"List(kick, buy, hamster, mention, fall, dog, memphi, click, stripe, maya)","List(0.00794663890182564, 0.0036863176841135494, 0.0034395560217161357, 0.003333626342089096, 0.002351169919305298, 0.002349822534940129, 0.0020978445026755355, 0.001971049295696903, 0.0016369577494271307, 0.00142389603380631)"
6,"List(milk, 4th, count, helmet, nicest, habitat, trim, uncl, irish, smell)","List(0.0032521754655874466, 0.0017736854839057114, 0.0013509824772520804, 0.0013379611305481566, 0.0010237932441303395, 9.156661845748229E-4, 8.459411637663644E-4, 7.544521454520792E-4, 7.327469516524592E-4, 6.724463210777417E-4)"
7,"List(obvious, huge, volunt, although, major, horrifi, asriel, japanes, desper, account)","List(0.0058933776862888365, 0.0024398911857580374, 0.0020811106180241914, 0.0019214063891186985, 0.001089151428205419, 0.0010793469362402237, 0.0010337980012431705, 0.0010282181771935621, 9.647898375048683E-4, 8.333131063778994E-4)"
8,"List(paw, climb, american, chewi, smack, dont, organ, smother, ima, decor)","List(0.007383873176066315, 0.003039706965814243, 0.0023266990748339123, 0.0014448208587298852, 0.0014071988029655837, 0.0012425939791765254, 0.0011919042055856053, 9.830965506288146E-4, 9.481686163019074E-4, 9.342897935757682E-4)"
9,"List(hot, soooo, favourit, haven, resourc, unlik, tht, cycl, suck, graduat)","List(0.004911224181075029, 0.003722172325273909, 0.0013886514418487747, 0.001096660140977006, 0.0010043018066334972, 8.679800321942473E-4, 8.413415849317605E-4, 8.294631776468398E-4, 8.287551414433097E-4, 7.407989793221809E-4)"


In [73]:
from pyspark.sql.functions import explode, arrays_zip

topics_words = topics.withColumn("termWithProb", explode(arrays_zip("terms", "termWeights")))
topics_words.createOrReplaceTempView("topics")


In [74]:
%sql

Select topic, termWithProb["terms"] as term, termWithProb["termWeights"] as probability
From topics

-- get words in each topics with probability

topic,term,probability
0,otherwis,0.0027535112394826
0,manag,0.0023611806016003
0,answer,0.0023605552017384
0,success,0.0023599399793519
0,heavi,0.0019673764775817
0,prepar,0.0018396274859288
0,highli,0.0015692128883641
0,nala,0.0015396320784823
0,disgust,0.0013742680942309
0,heartbeat,0.0012173444355712


In [75]:
%sql

Select termWithProb["terms"] as term, max(termWithProb["termWeights"]) as probability
From topics
Group by 1
Order by 2 Desc

term,probability
year,0.0520753290900503
happi,0.0511508265705477
dog,0.0273796125742855
like,0.0260163281773102
cat,0.0212699527063579
move,0.0208307165139997
look,0.0202559418465954
love,0.0171459501763329
get,0.0162833003678356
night,0.0137441574803722


## Step 5: Identify creators with pet owners in the audience. Find creators whose video commented by the most cat and/or dog owners. In other words, creators with the highest statistically significant percentages of cat and/or dog owners.

In [77]:
import matplotlib.pyplot as plt

Top 20 creators with most audience of pet owner users

In [79]:

top = spark.sql(
      """Select creator_name, count(*) as pet_owner_num, sum(case when sentiment_score>=0.05 then 1 else 0 end)/count(*) as positive_comment_rate
      From predict
      Where prediction=1
      Group by 1
      Order by 2 Desc, 3 Desc
      limit 10""")

display(top)


creator_name,pet_owner_num,positive_comment_rate
The Dodo,77973,0.5736601131160787
Robin Seplut,45130,0.6307334367383115
Brave Wilderness,41736,0.4754169062679701
Taylor Nicole Dean,39560,0.5164812942366026
Hope For Paws - Official Rescue Channel,34504,0.591670530952933
Gohan The Husky,30567,0.5974744004972683
Vet Ranch,23953,0.567027094727174
Gone to the Snow Dogs,23089,0.5696218978734462
Brian Barczyk,20673,0.6123446040729454
Viktor Larkhill,18529,0.5996006260456581


In [80]:
fig, (ax1, ax2) = plt.subplots(figsize=(10,6), ncols=2, constrained_layout=True)
top_pd = top.toPandas()

xticks = [name for name in top.select("creator_name").collect()]
display(xticks)

ax1.plot('creator_name','pet_owner_num',data=top_pd)
ax1.set_xlabel("creator name")
ax1.set_xticklabels(xticks,rotation=90)
ax1.set_ylabel("number of comment from pet owner users")
ax1.set_title("video creators received most comments from pet owner users", y=1.08)
ax1.legend()

ax2.plot('creator_name','positive_comment_rate',data=categories_p)
ax2.set_xlabel("creator name")
ax2.set_xticklabels(xticks,rotation=90)
ax2.set_ylabel("positive comment rate")
ax2.set_title("positive comment rate of each creator", y=1.08)
ax2.legend()

display(fig.show())

Find video creators received more than 1000 comments from pet owner users with **top positive comment rate**

In [82]:
%sql 

Select creator_name, count(*) as comment_num, sum(case when sentiment_score>=0.05 then 1 else 0 end)/count(*) as positive_comment_rate
From predict
Where prediction=1
Group by creator_name
Having count(*)>=1000
Order by 3 Desc, 2 Desc
limit 20

creator_name,comment_num,positive_comment_rate
Cute Cats Kwazi and Uli,1156,0.8685121107266436
Thor Unleashed,1040,0.8384615384615385
Schnauzer Mom,1255,0.7872509960159363
FROSTY Life,3371,0.7502224859092258
MaxluvsMya,6642,0.7488708220415538
Einstein Parrot,1363,0.7329420396184886
Teresa Bullock,1262,0.7297939778129953
Cockatoo Luck,2142,0.7259570494864612
HammyLux,6491,0.7196117701432753
BrookIvy3,3782,0.7176097303014278


Find video creators received more than 10K comments and have **top rate of (number of comments from pet owner users)/(number of comments from all users)**

In [84]:
%sql

with filter as 
  (Select creator_name, count(*) as num
  From predict
  Group by 1
  Having num>=10000
  ),
  part as 
  (Select pre.creator_name, prediction, count(*) as pet_owner_num
  From predict as pre
  Inner join filter as f on pre.creator_name=f.creator_name
  Group by 1,2
  Order by 1)

Select p.creator_name, pet_owner_num, pet_owner_num/num as fraction
From part as p
Inner join filter as f on p.creator_name=f.creator_name
Where p.prediction = 1
Order by 3 DESC



creator_name,pet_owner_num,fraction
Zak Georges Dog Training rEvolution,11302,0.6745449119665772
Gone to the Snow Dogs,23089,0.5263409852508719
MaxluvsMya,6642,0.5211046602855798
meow meow,6232,0.5137251669277059
Dogumentary TV,5149,0.4992727625327257
Robin Seplut,45130,0.496594372737376
stacyvlogs,15211,0.4791318864774624
Paws Channel,10343,0.4751251780054205
Kitten Academy,7459,0.4622869538270839
ViralBe,8364,0.4468187403173246


#### 5. Analysis and Future work

## Summary

In this project, I dealt with 5.8 million comments on videos about pet on Youtube to do **semantic analysis** and **recommendation to pet owner users**. 
1. **Filtered non-English comments out** and applied Vader **Sentiment Analysis** to classify the comments to Positive, Neutral and Negative.  
2. **Labelled comments which are from pet owner users** (label 1) by keywords like "my dog", "my cat", "I have a kitty" and extracted around 40,000 label 1 data. Added 40,000 label 0 data and splitted them into train and test dataset.
3. Built pipeline including tokenizer, stopword remover and word2Vec to **transform comments to numeric vector features**.
4. Trained classification models including Logistic Regression, Random Forest and Gradient Boosting and tuned their parameters via cross validation. Compared these classfication models and obtained the best performance model random forest with 93% accuracy. 
5. Applied random forest model to rest data to **classify comments from pet owner users, which were around 18.4% among all**.
6. **Extracted main topic of comments from pet owner users**, the most important words including "cat", "dog", "like", "cute" which didn't contain much useful information. I also tried added those words to stopwords, but still could not find interesting content words. This part needs further exploration.
7. Found video creators with more than 10,000 comments, "Zak Georges Dog Training rEvolution", "MaxluvsMya", "Lennon The Bunny" are top 3 creator with **highest fraction of commented by pet owner users**. Also obtained video creators received more than 1000 comments from pet owner users with **top positive comment rate**. For instance, the top 3 creators 'Cute Cats Kwazi and Uli', 'Thor Unleashed' and 'Schnauzer Mom' received respetively 86.8%, 83.8% and 78.7% Positive comments from pet owner users. These creators can be **recommended to users interested in pets**.

## Future work
1. Identify pet owner users based on more information.
2. Explore other methods to extract significant topics for pet owner users to obtain useful information.