#  Scraping and Analyzing Youtube Comments

This script will give a detailed walktrough of how to use the Tuber package (https://cran.r-project.org/web/packages/tuber/tuber.pdf) to extract youtube comments, format them and give some basic examples for analysis of text and contained emojis.

The script is a preliminary part of an ongoing research project (https://www.researchgate.net/project/Methods-and-Tools-for-Automatic-Sampling-and-Analysis-of-YouTube-Comments) and will be subject to change. If you use substantive parts of this script as part of your own research, please cite it.

## Setup

First of all, we need to set up our R environment correctly so we are able to use the tuber package and all other
necessary packages.

In [None]:
# removing all objects from the current global environment
rm(list=ls())

In [None]:
# choose a working directory in a GUI window
dir <- choose.dir()

In [None]:
# changing working directory
setwd(dir)

In [None]:
# remove directory string from global environment
rm(dir)

In [None]:
# set options parameter so that textstrings are not interpreted as factor variables
options(stringsAsFactors = FALSE) 

In [None]:
# create list of necessary packages for data extraction and analysis
packages <- c("devtools",
              "tm",
              "quanteda",
              "tuber",
              "qdapRegex",
              "rlang",
              "purrr",
              "ggplot2") 

In [None]:
# installing all packages in list
install.packages(packages,repos='http://cran.us.r-project.org')

In [None]:
# attaching all packages in list
lapply(packages, library, character.only = TRUE)

In [None]:
# removing the list-variable with the package names from the global environment
rm(packages)

In [None]:
# Installing the emo package from github (not on CRAN yet)
devtools::install_github("hadley/emo")
library(emo)

In [None]:
# checking if packages have been properly attached
sessionInfo()

# Authentication for Youtube API

To get access to data from Youtube, we have to use the Youtube API (https://developers.google.com/youtube/v3/).
To do this, we need a token that identifies us when using the API so Youtube can be sure
that their Terms and Conditions are respected (e.g. there is a certain limit how much data you can access in
a given timeframe). We´ll go about this step by step:

1. If you do not have a google account that you are willing to use for this project, you have to create a new one here https://accounts.google.com/signup/v2/webcreateaccount?hl=en-GB&flowName=GlifWebSignIn&flowEntry=SignUp

2. With your google account, you have to create a "Google Project". Detailed instructions can be found here: https://www.youtube.com/watch?v=Im69kzhpR3I

3. Use the credentials of the project-account to authenticate your R-session. By doing this, you allow R access to all data on youtube,that you would be able to see if you went to the site logged in with this account. Be carefull to **NOT** share there credentials with anyoneelse you don´t want to be able to log into this account. To get the credentials, on the project website:

- activate Youtube Date API v.3
- click on create credentials
- select authentification trough webbrowser
- select access private data with users permission
- set "autorised referral URL" to "http://localhost:1410/"

4. Run the authentification for the R session using the credentials from the Google project:

In [None]:
appID <- "" # Insert your own app Id here
appSecret <- "" # insert your own app Secret here

Upon running the following line, there will be a prompt in the console asking you to save the access token in a file
select "No" by entering 2 in the console and hitting enter.
Afterwards, a browser window should open, prompting you to log in with your google account
After logging in, you can close the browser and come back to R

In [None]:
yt_oauth(appID,appSecret)

**IMPORTANT**: Sometimes, the authentification stops working (HTTP Failure: 401). If that happens, your authentification token has expired and you need to sign into your google account once more. To do that, simply go to the working directory, delete the file ".httr-oauth", rerun the authentification command in R and a browser window with your google account login screen should open. You need to log in and close the browser afterwards. To prevent it from happening in the future, do **NOT** save your token in an object (press 2: "No" in the command line dialogue for the authentification)

# Extracting Comments

First, we need the video ID of the youtube video(s) in question.
We can find it by navigating to the Video in our webbrowser, and simply
copying the last string of the URL that comes after the part "?v="

As an example, lets take the song "Baby" by Justin Bieber.
You can find it on youtube here: https://www.youtube.com/watch?v=kffacxfA7G4

So the necessary video ID would be "kffacxfA7G4"

In [None]:
# saving the video ID in a variable
VidID <- "kffacxfA7G4"

In [None]:
# set_results value between 20 and 100 to scrape an excerpt of the comments
Comments <- get_comment_threads(c(video_id=VidID), max_results = 100)

In [None]:
# BEFORE your run this command, do a test run and see how quick you can fetch a few hundred comments
# extrapolate from that whether you can safely fetch all comments at once or if you have to split
# it up into smaller chunks.

# Not recommended for the Justin Bieber Video (4.5 Million comments)

# to extract all comments
Comments <- get_all_comments("ENTER YOUR VIDEO ID HERE")

**NOTE** Your dataset might contain less comments than are displayed as total comments on YouTube. This is because the tuber package only scrapes up to five _replies_ to each comment. If a comment has more than five replies, all subsequent replies will not be extracted by the tuber package.

# Formatting the Data

To use the data effectively, we will have to format the data. Formatting will include:

- Extract Emojis from comments and format them in human-readable and R-friendly formats
- Extracting URLs from comments
- Handling special characters in comments
- Properly formatting timestamps into a format usable by R

To this end, we wrote a custom function for formatting the data. This function is still work in progress and and this point far from computationally efficient, but it might nonetheless help you to bring the comments into a format you can work with.

In [None]:
yt_parse <- function(x){
        
        #### We need to check first whether the data has 15 columns (Videos for which user has mod rights) or less (regular public videos)
        
        if (dim(x)[2] > 12) {
                
                # only keeping relevant columns
                x <- x[,c(1,7,10,11,12,13,14)]
                
                # Converting dataframe columns to proper classes
                x[,1] <- as.factor(x[,1])
                x[,2] <- as.character(x[,2])
                x[,3] <- as.numeric(x[,3])
                x[,6] <- as.character(x[,6])
                x[,7] <- as.character(x[,7])
                
                # converting timestamps into proper date-time objects
                Published <- unlist(lapply(as.character(x[,4]),function(x){paste(substr(x,1,10),substr(x,12,19),sep = "-")}))
                x[,4] <- as.POSIXct(Published, format ="%Y-%m-%d-%H:%M:%S ", tz = "UTC")
                
                Updated <- unlist(lapply(as.character(x[,5]),function(x){paste(substr(x,1,10),substr(x,12,19),sep = "-")}))
                x[,5] <- as.POSIXct(Updated, format ="%Y-%m-%d-%H:%M:%S ", tz = "UTC")
                
                
                #### Emoji
                
                ## we need a function to transfer Emoji Names to CamelCase (taken from: )
                simpleCap <- function(x) {
                        s <- strsplit(x, " ")[[1]]
                        paste(toupper(substring(s, 1,1)), substring(s, 2),
                              sep="", collapse=" ")
                }
                
                ## We need a function to detect and replace EMOJI in the fulltext comments
                
                ReplaceEM <- function(x) {
                        
                        
                        # Setup: importing emoticon List
                        EmoticonList <- jis
                        
                        ListedEmojis <- as.list(jis[,4])
                        CamelCaseEmojis <- lapply(jis$name,simpleCap)
                        CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ","",x,fixed=TRUE)})
                        EmoticonList[,4]$name <- unlist(CollapsedEmojis)
                        
                        # order the list by the length of the string to avoid partial matching of shorter strings
                        EmoticonList <- EmoticonList[rev(order(nchar(jis$emoji))),]
                        
                        # Setup: We need to assign x to a new variable so we can save the progress in the for loop
                        New <- x
                        
                        # rm_default throws a useless warning on each iteration that we can ignore
                        oldw <- getOption("warn")
                        options(warn = -1)
                        
                        # cycle through the list and replace everything
                        # we have to add clean = FALSE and trim = FALSE to not delete whitespaces that are part of the pattern.
                        
                        for (i in 1:dim(EmoticonList)[1]){
                                
                                New <- rm_default(New, pattern=EmoticonList[i,3],replacement= paste0("EMOJI_", EmoticonList[i,4]$name, " "), fixed = TRUE, clean = FALSE, trim = FALSE)
                                
                        }
                        
                        # turning warnings back on
                        options(warn = oldw)
                        
                        # output result
                        return(New)
                        
                }
                
                # Creating a Text column where Emojis are replaced by their textual descriptions
                
                TextEmoRep <- ReplaceEM(x[,2])
                
                # Creating a text column where Emojis are deleted

                TextEmoDel <- emo::ji_replace_all(x[,2],"")
                
                # Creating a Column listing only the textual descriptions of Emojis per message
                
                ExtractEM <- function(x){
                        
                        SpacerInsert <- gsub(" ","[{[SpAC0R]}]", x)
                        ExtractEmoji <- rm_between(SpacerInsert,"EMOJI_","[{[SpAC0R]}]",fixed=TRUE,extract = TRUE, clean= FALSE,trim=FALSE,include.markers = TRUE)
                        UnlistEmoji <- unlist(ExtractEmoji)
                        DeleteSpacer <- sapply(UnlistEmoji,function(x){gsub("[{[SpAC0R]}]"," ",x,fixed=T)})
                        names(DeleteSpacer) <- NULL
                        
                        Emoji <-paste0(DeleteSpacer,collapse="")
                        return(Emoji)
                        
                }
                
                # Extracting and renaming the Emojis
                Emoji <- sapply(TextEmoRep,ExtractEM)
                
                #### Links
                
                # Extracting links from comments
                
                Links <- qdapRegex::rm_url(x[,2], extract = TRUE)
                Links <- I(Links)
                
                #### Combining it into one dataframe

                a <- cbind.data.frame(x[,1],Emoji)
                
                df <- cbind.data.frame(x[,1],x[,2],TextEmoRep,TextEmoDel,Emoji,x[,3],Links,x[,4],x[,5],x[,6],x[,7])
                names(df) <- c("Author","Text","TextEmojiReplaced","TextEmojiDeleted","Emoji","LikeCount","URL","Published","Updated","CommentID","ParentID")
                row.names(df) <- NULL
                
                
        }
        
        else if(dim(x)[2] == 12){
                
                # only keeping relevant columns
                x <- x[,c(1,7,10,11,12)]
                
                # Converting dataframe columns to proper classes
                x[,1] <- as.factor(x[,1])
                x[,2] <- as.character(x[,2])
                x[,3] <- as.numeric(x[,3])
                
                # converting timestamps into proper date-time objects
                Published <- unlist(lapply(as.character(x[,4]),function(x){paste(substr(x,1,10),substr(x,12,19),sep = "-")}))
                x[,4] <- as.POSIXct(Published, format ="%Y-%m-%d-%H:%M:%S ", tz = "UTC")
                
                Updated <- unlist(lapply(as.character(x[,5]),function(x){paste(substr(x,1,10),substr(x,12,19),sep = "-")}))
                x[,5] <- as.POSIXct(Updated, format ="%Y-%m-%d-%H:%M:%S ", tz = "UTC")
                
                
                #### Emoji
                
                ## we need a function to transfer Emoji Names to CamelCase (taken from: )
                simpleCap <- function(x) {
                        s <- strsplit(x, " ")[[1]]
                        paste(toupper(substring(s, 1,1)), substring(s, 2),
                              sep="", collapse=" ")
                }
                
                ## We need a function to detect and replace EMOJI in the fulltext comments
                
                ReplaceEM <- function(x) {
                        
                        
                        # Setup: importing emoticon List
                        EmoticonList <- jis
                        
                        ListedEmojis <- as.list(jis[,4])
                        CamelCaseEmojis <- lapply(jis$name,simpleCap)
                        CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ","",x,fixed=TRUE)})
                        EmoticonList[,4]$name <- unlist(CollapsedEmojis)
                        
                        # order the list by the length of the string to avoid partial matching of shorter strings
                        EmoticonList <- EmoticonList[rev(order(nchar(jis$emoji))),]
                        
                        # Setup: We need to assign x to a new variable so we can save the progress in the for loop
                        New <- x
                        
                        # rm_default throws a useless warning on each iteration that we can ignore
                        oldw <- getOption("warn")
                        options(warn = -1)
                        
                        # cycle through the list and replace everything
                        # we have to add clean = FALSE and trim = FALSE to not delete whitespaces that are part of the pattern.
                        
                        for (i in 1:dim(EmoticonList)[1]){
                                
                                New <- rm_default(New, pattern=EmoticonList[i,3],replacement= paste0("EMOJI_", EmoticonList[i,4]$name, " "), fixed = TRUE, clean = FALSE, trim = FALSE)
                                
                        }
                        
                        
                        # turning warnings back on
                        options(warn = oldw)
                        
                        # output result
                        return(New)
                        
                }
                
                # Creating a Text column where Emojis are replaced by their textual descriptions
                
                TextEmoRep <- ReplaceEM(x[,2])
                
                # Creating a text column where Emojis are deleted
                
                TextEmoDel <- emo::ji_replace_all(x[,2],"")
                
                # Creating a Column listing only the textual descriptions of Emojis per message
                
                ExtractEM <- function(x){
                        
                        SpacerInsert <- gsub(" ","[{[SpAC0R]}]", x)
                        ExtractEmoji <- rm_between(SpacerInsert,"EMOJI_","[{[SpAC0R]}]",fixed=TRUE,extract = TRUE, clean= FALSE,trim=FALSE,include.markers = TRUE)
                        UnlistEmoji <- unlist(ExtractEmoji)
                        DeleteSpacer <- sapply(UnlistEmoji,function(x){gsub("[{[SpAC0R]}]"," ",x,fixed=T)})
                        names(DeleteSpacer) <- NULL
                        
                        Emoji <-paste0(DeleteSpacer,collapse="")
                        return(Emoji)
                        
                }
                
                # Extracting and renaming the Emojis
                Emoji <- sapply(TextEmoRep,ExtractEM)
                
                
                #### Links
                
                # Extracting links from comments
                
                Links <- qdapRegex::rm_url(x[,2], extract = TRUE)
                Links <- I(Links)
                
                #### Combining it into one dataframe
                
                df <- cbind.data.frame(x[,1],x[,2],TextEmoRep,TextEmoDel,Emoji,x[,3],Links,x[,4],x[,5])
                names(df) <- c("Author","Text","TextEmojiReplaced","TextEmojiDeleted","Emoji","LikeCount","URL","Published","Updated")
                row.names(df) <- NULL
                
                
        }
        
        
        #### Returning dataframe
        
        return(df)
        
}

We can simply use this function on the comments dataframe that we extracted to get a new dataframe that is
formatted nicely

In [None]:
# using the function to format the "Comments" dataframe 
FormattedComments <- yt_parse(Comments)

In [None]:
# Displaying first 10 formatted comments
head(FormattedComments,10)

# Analysis of Text

**Disclaimer**: The textual analysis is adapted from: https://docs.quanteda.io/articles/pkgdown/examples/plotting.html

For the textual analysis, we will use the column of our formatted dataframe that does **not** contain any Emojis in any format.

In [None]:
# Diyplaying first 10 elements of the column we will use for the text analysis
head(FormattedComments$TextEmojiDeleted,10)

Next, we will tokenize the the comments (e.g. splitting them into single words). At the same time, we will remove numbers, punctuation,
seperators, symbols, hyphens and URLs.

In [None]:
## tokenizing the comments (splitting them into single words and signs)
toks <- tokens(char_tolower(FormattedComments$TextEmojiDeleted),
               remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               remove_hyphens = TRUE,
               remove_url = TRUE)

In [None]:
# displaying the first 10 tokenized comments
toks[1:10]

Next, we will create a document frequency matrix (https://en.wikipedia.org/wiki/Document-term_matrix .
In our case, a document is a comment, so  put simply, this matrix counts how often each of the words contained in any comment is appearing in every single comment. We do this while removing Stopwords (https://en.wikipedia.org/wiki/Stop_words), which can negatively influence the analysis.

In [None]:
# We build a document frequency matrix while removing Stopwords
# Stopwords are very frequent words that occur in all texts (e.g. "a","but","it")
commentsDfm <- dfm(toks, remove = quanteda::stopwords("english"))

In [None]:
# We can display the frequency of terms in the documents
TermFreq <- textstat_frequency(commentsDfm)
TermFreq

### Visualizing by total occurances

In this step, we want to use the DFM we created to visualize the most freqeuent tokens across all comments. First, we order
the tokens, then we plot their frequency.

In [None]:
# Sort by reverse frequency order
TermFreq$feature <- with(TermFreq, reorder(feature, -frequency))

In [None]:
# Plotting the x most common tokens (you can change x to suit your needs)
x <- 25

ggplot(TermFreq[1:x], aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

This overview might be biased because it just counts the total sum of accurances of each tokens across all comments. It might thus be that there is only one person spamming the same word hundreds of times in a single comment. To mitigate this, we will also count the number of comments that contain each token at least once.

In [None]:
# Plotting the x tokens that are used in the highest number of comments (you can change x to suit your needs)
x <- 25

ggplot(TermFreq[1:x], aes(x = feature, y = docfreq)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Custom Stopwords

After inspecting the most frequent terms, we might want to exclude certain terms that are not indicative for the comments (e.g. the word "video")or certain words that are used by spammers (e.g. "viagra"). Which words to exclude is the individual decision of each researcher. You should be carefull with designing this list for your needs and be transparent which words you excluded and why.

In [None]:
# Custom Stopword List
CustomStops <- c("video","wow","million","much","oh","zouhir","bahoui")

# This is just an example, you should carefully create your own list

In [None]:
# We can create another document-frequency matrix that excludes the Custom Stopwords that we just defined, and then rerun the code above to update our results
commentsDfm <- dfm(toks, remove = c(quanteda::stopwords("english"),CustomStops))

In [None]:
# rerunning steps from above in one cell with new DFM (excluding custom stop words)
TermFreq <- textstat_frequency(commentsDfm)
TermFreq$feature <- with(TermFreq, reorder(feature, -frequency))
x <- 25
ggplot(TermFreq[1:x], aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Word Cloud

You can also use the DFM to very easily create a wordcloud to visualize the most frequent tokens

In [None]:
# Wordcloud for most frequently used Terms
set.seed(12345)
textplot_wordcloud(dfm_select(commentsDfm, min_nchar=3),
                   random_order=FALSE,
                   max_words=100)

# Sentiment Analysis of Text

We can use Sentiment Analysis (https://en.wikipedia.org/wiki/Sentiment_analysis) on the comments to get an intuition about peoples opinions towards the content of the videos. Sentiment analysis works by comparing the tokens in each comment to a dictionary of words with an attached sentiment rating. For example, the word "fuck" would have a negative sentiment rating in the dictionary while the word "love" would have a positive sentiment rating. If we add the sentiment scores of all tokens in a given comment, we get an overall sentiment of that comment.

Of course, the results dependend on the kind of dictionary that is used. We chose the AFINN dictionary (http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), because it is
based on the language in microblogs and thus might capture the slang/tone of online comments better than other dictionaries.

However, the choice of the dictionary is up to the individual researcher.

In [None]:
# getting sentiment scores per comment
CommentSentiment <- syuzhet::get_sentiment(FormattedComments$TextEmojiDeleted, method = "afinn")

Lets get some summary statistics and a basic vizualisation of sentiment across all comments

In [None]:
# Basic statistics
summary(CommentSentiment)
boxplot(CommentSentiment)

We can also manually inspect comments that have extreme ratings as it´s always good to check outliers

In [None]:
# displaying comments with a sentiment score below x
x  <- -3
FormattedComments$TextEmojiDeleted[CommentSentiment < x]

In [None]:
# disyplaying comments with a sentiment score above x
x <- 5
FormattedComments$TextEmojiDeleted[CommentSentiment > x]

In [None]:
# displaying most negative/positive comment
FormattedComments$TextEmojiDeleted[CommentSentiment == min(CommentSentiment)]
FormattedComments$TextEmojiDeleted[CommentSentiment == max(CommentSentiment)]

### Visualizing Comment Sentiment

To get a better overview, we can also display the total amount of positive, negative and neutral comments. To this end,
we need to create a new dataframe as a crutch to categorize the comments first.

We can also display the total distribution of sentiment in comments and overlay the mean.

In [None]:
# building helper Frame
Desc <- CommentSentiment
Desc[Desc > 0] <- "positive"
Desc[Desc < 0] <- "negative"
Desc[Desc == 0] <- "neutral"
df <- data.frame(FormattedComments$TextEmojiDeleted,CommentSentiment,Desc)
colnames(df) <- c("Comment","Sentiment","Desc")

In [None]:
# displaying the amount of positive, negative and neutral comments
ggplot(data=df, aes(x=Desc, fill = Desc)) +
        geom_bar(stat='count')

In [None]:
# distribution of comment sentiments
ggplot(df, aes(x=Sentiment)) +
        geom_histogram(binwidth = 1) +
        geom_vline(aes(xintercept=mean(Sentiment)),
           color="black", linetype="dashed", size=1)

# Emoji Analysis

So far, we only analyzed the text of the comments but we can also analyze the used Emojis in the comments.
To this end, we first format NA values correctly for comments that do not contain any Emojis.
Second, we tokenize the Emojis just as we did with the text strings. Then, we build an EmojiFreqeuncy Matrix
that counts how often each Emojis is contained in every comment. Lastly, we visualize our results.

In [None]:
# Formatting NA´s correctly
FormattedComments$Emoji[FormattedComments$Emoji == "NA"] <- NA

In [None]:
# removing spaces at the end of the string
FormattedComments$Emoji <- substr(FormattedComments$Emoji, 1, nchar(FormattedComments$Emoji)-1)

In [None]:
# tokinizing
EmojiToks <- tokens(FormattedComments$Emoji)

In [None]:
# Displaying the Emojis in the first 10 comments
EmojiToks[1:10]

In [None]:
# We build an Emoji Frequency Matrix, excluding "NA" as a term
EmojiDfm <- dfm(EmojiToks,remove = "NA")

In [None]:
# We can display the frequency of Emojis in the documents
EmojiFreq <- textstat_frequency(EmojiDfm)
EmojiFreq

In [None]:
# We can also get a more sparse overview of the x top Emojis in the comments
x = 20
topfeatures(EmojiDfm,x)

### Visualizing by total occurances

Using the EmojiFrequency Matrix, we can plot the most frequently occuring Emojis across all comments

In [None]:
# Sort by reverse frequency order
EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -frequency))

In [None]:
# Plotting
ggplot(EmojiFreq, aes(x = feature, y = frequency)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

### Visualizing by number of comments containing the Emoji at least once

This overview might be biased because it just counts the total sum of accurances of each Emoji across all comments. It might thus be that there is only one person spamming the same Emoji hundreds of times in a single comment. To mitigate this, we will also count the number of comments that contain each Emoji at least once.

Basically, we´re counting the number of comments that do contain the Emoji.

In [None]:
# sort by reverse document frequency order
EmojiFreq$feature <- with(EmojiFreq, reorder(feature, -docfreq))

#plotting
ggplot(EmojiFreq, aes(x = feature, y = docfreq)) +
        geom_point() + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Sentiment Analysis for Emojis (experimental)

Just like text (and arguably even more so), Emojis are used to confer emotions and opinion. For this reason, we´re trying here for an explorative sentiment analysis using the Emojis in the comments. Just as for the sentences,we thus need a dicitionary that maps Emojis to Sentiments. Unfortunately, there so far seems to be only one sentiment dictionary for  the most commonly used 734 Emojis (http://kt.ijs.si/data/Emoji_sentiment_ranking/) . Thus, we currently do not have the possibility to check the results with different dictionaries and cannot inlcude all Emojis in this analysis.

In [None]:
# importing emojis dictionary (We only get 734 different Emojis but thats the best data we have on Emoji Sentiment)
EmojiSentiments <- lexicon::emojis_sentiment

In [None]:
# displaying the first ten rows of the Emoji Sentiment dictionary
EmojiSentiments[1:10,]

In [None]:
# we have to match the sentiment scores to our codings of the emojis and create a quanteda dictionary object
EmojiNames <- paste0("emoji_",gsub(" ","",EmojiSentiments$name))
EmojiSentiment <- cbind.data.frame(EmojiNames,EmojiSentiments$sentiment,EmojiSentiments$polarity)
names(EmojiSentiment) <- c("word","sentiment","valence")
EmojiSentDict <- as.dictionary(EmojiSentiment[,1:2])

In [None]:
# tokenizing the Emoji-only column in our formatted dataframe
EmojiToks <- tokens(tolower(FormattedComments$Emoji))

In [None]:
# We can now replace the emojis in the dictionary with the corresponding sentiment scores
EmojiToksSent <- tokens_replace(EmojiToks,EmojiSentDict)

In [None]:
# checking how many Emoji we can cover with sentiment scores
A <- unlist(EmojiToksSent)
names(A) <- NULL

After mapping the Emojis in our dataset to the sentiment scores in the dictionary, we can check how many Emojis we have in total in our dataset and how many of those we are getting sentiment mappings for.

In [None]:
# total Emoji
B <- A[A!="NA"]
length(B)

In [None]:
## sanity check

# Number of Emoji that couldn´t be replaced
length(grep("emoji_",B))

# number of Emoji that could be replaced
length(grep("0.",B))

# Percentage of Emoji that couldn´t be replaced
length(grep("emoji_",B))/length(B)

# Percentage of Emoji that could be replaced
length(grep("0.",B))/length(B)

We have to add sentiments for Emojis within the same comment to get an overall comment sentiment based on Emojis

In [None]:
# Computing sentiment scores for comments based on Emoji

# only keeping the replaced sentiment scores for th Emoji vector
D <- tokens_select(EmojiToksSent,EmojiSentiment$sentiment,"keep")
D <- as.list(D)

# function to add sentiment scores of Emojis per comment
AddEmojiSentiments <- function(x){
        
        x <- sum(as.numeric(as.character(x)))
        return(x)
        
}


AdditiveEmojiSentiment <- lapply(D,AddEmojiSentiments)
AdditiveEmojiSentiment[AdditiveEmojiSentiment == 0] <- NA
AdditiveEmojiSentiment <- unlist(AdditiveEmojiSentiment)

We can now visualize the results

In [None]:
# plotting histogram for distribution of Emoji Sentiment Scores
hist(AdditiveEmojiSentiment)

In [None]:
## correlation between Emoji sentiment score and text Sentiment Score
cor(CommentSentiment,AdditiveEmojiSentiment,use="complete.obs")

In [None]:
## plotting the relationship
plot(CommentSentiment,AdditiveEmojiSentiment)