In [1]:
# Load libraries ----------------------------------------------------------------------------------------------------------------------------------------------
library(tidyverse) # metapackage with lots of helpful functions
library(tidytext)


# list.files(path = "../input")



── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.4     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



There are three .csv files in the directory structure:

In [2]:
directory_content = list.files("../input/bda2021big5/youtube-personality", full.names = TRUE)
# print(directory_content)

In addition there's a "transcript" folder (see number \[2\] in the output above) in which the actual video transcripts are stored in `.txt` files. 

Store these file paths in variables for easy reference later on:

In [3]:
# Path to the transcripts directory with transcript .txt files
path_to_transcripts = directory_content[2] 

# .csv filenames (see output above)
AudioVisual_file    = directory_content[3]
Gender_file         = directory_content[4]
Personality_file    = directory_content[5]

# 1. Import the data

We'll import: Transcripts, Personality scores, Gender

## 1.1 Importing transcripts

The transcript text files are stored in the subfolder 'transcripts'. They can be listed with the following commands:

In [4]:
transcript_files = list.files(path_to_transcripts, full.names = TRUE) 

# print(head(transcript_files))


The transcript file names encode the vlogger ID that you will need for joining information from the different data frames. A clean way to extract the vlogger ID's from the names is by using the funcation `basename()` and removing the file extension ".txt".

In [5]:
vlogId = basename(transcript_files)
vlogId = str_replace(vlogId, pattern = ".txt$", replacement = "")
# head(vlogId)

To include features extracted from the transcript texts you will have to read the text from files and store them in a data frame. For this, you will need the full file paths as stored in `transcript_files`.

Here are some tips to do that programmatically

- use either a `for` loop, the `sapply()` function, or the `map_chr()` from the `tidyverse`
- don't forget to also store `vlogId` extracted with the code above 

We will use the `map_chr()` function here:

In [6]:
transcripts_df = tibble(
    
    # vlogId connects each transcripts to a vlogger
    vlogId=vlogId,
    
    # Read the transcript text from all file and store as a string
    Text = map_chr(transcript_files, ~ paste(readLines(.x), collapse = "\\n")), 
    
    # `filename` keeps track of the specific video transcript
    filename = transcript_files
)
# head(transcripts_df)


“incomplete final line found on '../input/bda2021big5/youtube-personality/transcripts/VLOG11.txt'”


In [7]:
transcripts_df %>% 
    head(1)


vlogId,Text,filename
<chr>,<chr>,<chr>
VLOG1,"You know what I see - - no, more like hear a lot these days, is people calling other people gay as an insult. Now what makes people come up with calling others gay? Now here's an example. Hey, hey, you wanna trade Pokemon or Ziegfield cards? Or, or, or we can play, we can play superheroes. Oh, can I be Optimus Prime? Dude, you are so gay. Dude, the cool kids do crack. Oh, my mommy says, say no to drugs. Okay, how the hell does playing Pokemon cards or -- or --- or dancing or holding hands with another guy make me homosexual? I don't get these people. \nThis is how it is in my school. Okay, here's an example. All right, um, when they see two guys are gay, they're together, they're like no, ew, no. No, no that -- that doesn't go together - - you know, two guys, no. two sticks, no. It just doesn't work like . But when they see two girls, they're like, get it on. And I don't get these people. I've never seen someone say like, oh, you're so homosexual or you're so lesbian or you're such a child molester. It is always the word gay, cause apparently gay is now an insult, even though the word means like happy and lively and that kinda giddy feeling you have inside, like -- -- but no you have to turn that happy word into a mean word. Apparently, we can do that now, turning good things into bad things. It's like how Spiderman felt good, but then that -- that -- that grease that gets all over him and then and then evil Dr. Octopus. That's so gay, you like Spiderman. Lar, I'm going to the movies with the guys to watch Mama Mia. \nYou never know if other people are offended by what you say. I'm not saying you're a bad person if you do it. I used to do it all the time. I'm more focused on why we say it. In the end, we're all the same. You know, there's nothing wrong with it. I was just wondering where it all came from, you know. All right, thanks a lot for watching. Oh, yeah and the club channel is up and running. So, make sure to check that out because there's gonna be a lot of cool stuff on there. We'll do up to like four challenges at a time. We'll do contests, dares, questions. In the end, there's gonna be a lot of viewer interactions, so it's gonna be really fun. We may even put other people on the video too. So check it.",../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt



## Import personality scores

In [8]:
# Import the Personality scores
pers_df = read_delim(Personality_file, delim=" ")
head(pers_df)


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  vlogId = [31mcol_character()[39m,
  Extr = [32mcol_double()[39m,
  Agr = [32mcol_double()[39m,
  Cons = [32mcol_double()[39m,
  Emot = [32mcol_double()[39m,
  Open = [32mcol_double()[39m
)




vlogId,Extr,Agr,Cons,Emot,Open
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
VLOG1,4.9,3.7,3.6,3.2,5.5
VLOG3,5.0,5.0,4.6,5.3,4.4
VLOG5,5.9,5.3,5.3,5.8,5.5
VLOG6,5.4,4.8,4.4,4.8,5.7
VLOG7,4.7,5.1,4.4,5.1,4.7
VLOG9,5.6,5.0,4.0,4.2,4.9


## Import gender

Gender info is stored in a separate `.csv` which is also delimited with a space. This file doesn't have column names, so we have to add them ourselves:

In [9]:
gender_df = read.delim(Gender_file, head = FALSE, sep= " ", skip = 2)

# Add column names
names(gender_df) = c('vlogId', 'gender')
# head(gender_df)

## Merging the `gender` and `pers` dataframes

Obviously, we want all the information in a single tidy data frame. While the builtin R function `merge()` can do that, the `tidyverse()` has a number of more versatile and consistent functions called `left_join`, `right_join`, `inner_join`, `outer_join`, and `anti_join`. We'll use `left_join` here to merge the gender and personality data frames:

In [10]:
# using left-join therefore any pers that do not have a corresponding gender score will be deleted 

vlogger_df <- left_join(gender_df, pers_df)
# head(vlogger_df) 


Joining, by = "vlogId"



Note that some rows, like row 5, has `NA`'s for the personality scores. This is because this row corresponds to the vlogger with vlogId `VLOG8` is part of the test set. You still have to split `vlogger_df` into the training and test set, as shown below.

We leave the `transcripts_df` data frame seperate for now, because you will first have to extract features from the transcripts first. Once you have those features in a tidy data frame, including a `vlogId` column, you can refer to this `left_join` example to merge your features with `vlogger_df` in one single tidy data frame.

# 2. Feature extraction from transcript texts
### Tokenizing the text
We have "built" 3 features ourselves, used one new word-list after exploring the literature and used two of the word-lists already given to us. In total, we have 6 predictors. 


In [11]:
# splitting all the sentences into tokens (words)

transcripts_unnest <- 
transcripts_df %>%
  unnest_tokens(word, Text, token = 'words')

tail(transcripts_unnest) # check whether tokenizing worked

vlogId,filename,word
<chr>,<chr>,<chr>
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,my
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,life
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,i'll
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,keep
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,you
VLOG99,../input/bda2021big5/youtube-personality/transcripts/VLOG99.txt,updated


In [12]:
# removing filename column 
head(transcripts_unnest) # still has file name
transcripts_unnest <- transcripts_unnest[, -2] # remove filename

vlogId,filename,word
<chr>,<chr>,<chr>
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,you
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,know
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,what
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,i
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,see
VLOG1,../input/bda2021big5/youtube-personality/transcripts/VLOG1.txt,no


## Feature 1: Counting number of word occurences 
It is hypothesized that the number of words a vlogger uses predicts the vloggers personality. 
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791


In [13]:
# count word occurences 

word_count <-   
  transcripts_unnest %>% count(vlogId) %>%  # count tokens
  rename(n_words = n) # changing column names

head(word_count) # check the feature word count


vlogId,n_words
<chr>,<int>
VLOG1,435
VLOG10,449
VLOG100,293
VLOG102,1346
VLOG103,769
VLOG104,788


## Feature 2: Counting average word length 
It is hypothesized that the average length of words vloggers use predicts their personality. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791

In [14]:
# feature extraction: counting average length of word 

length <- nchar(transcripts_unnest$word) # counting number of characters in each word

word_length <- cbind(transcripts_unnest[, 1], length) # matrix with word_length included 
 
# calculate average word length 
word_length <- 
    word_length[, 2] %>%
    aggregate(list(word_length$vlogId), mean) %>% # calculating average word-length for each group 
  rename(word_length = x, vlogId = Group.1) # changing the column names so we can join them to another table by vlogId later 
head(word_length)

Unnamed: 0_level_0,vlogId,word_length
Unnamed: 0_level_1,<chr>,<dbl>
1,VLOG1,3.903448
2,VLOG10,4.552339
3,VLOG100,3.767918
4,VLOG102,3.971768
5,VLOG103,3.721717
6,VLOG104,4.317259


## Empathy and Distress
These wordlists are found to relate to personality. 
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791

In [15]:
# load empathy and the stress word lists
empathy_data = read.table("../input/empathy-lexicon/empathy_lexicon.txt", header = TRUE, sep = ',')
distress_data = read.table("../input/distress-lexicon/distress_lexicon.txt", header = TRUE, sep = ',')

head(empathy_data)
head(distress_data)

Unnamed: 0_level_0,word,rating
Unnamed: 0_level_1,<chr>,<dbl>
1,helps,4.315954
2,uncommon,2.534964
3,blank,3.559863
4,iraqis,5.446981
5,explored,4.401998
6,concentrate,3.637599


Unnamed: 0_level_0,word,rating
Unnamed: 0_level_1,<chr>,<dbl>
1,helps,2.409573
2,uncommon,1.303017
3,blank,3.729931
4,iraqis,5.585402
5,explored,3.672773
6,concentrate,2.681559


## Feature 3: Empathy 

In [16]:
# Do an inner join an essay token data frame and empathy word list
empathy <- 
    inner_join(transcripts_unnest, empathy_data, by = 'word') 

#count the empathy score

empathy <- 
    empathy[, 3] %>%
    aggregate(list(empathy$vlogId), mean) 


empathy  <- empathy  %>% # changing the column names so we can join them to another table by vlogId later 
  rename(empathy = rating, vlogId = Group.1)

head(empathy) # check the feature


Unnamed: 0_level_0,vlogId,empathy
Unnamed: 0_level_1,<chr>,<dbl>
1,VLOG1,3.060428
2,VLOG10,3.261867
3,VLOG100,3.332329
4,VLOG102,3.123552
5,VLOG103,3.130718
6,VLOG104,3.166712


## Feature 4:  Distress

In [17]:
# Do an inner join an essay token data frame and distress word list
distress = 
    inner_join(transcripts_unnest, distress_data, by = 'word') 

#count the distress score

distress <- 
    distress[, 3] %>%
    aggregate(list(distress$vlogId), mean) 

distress  <- distress  %>% # changing the column names so we can join them to another table by vlogId later 
  rename(distress = rating, vlogId = Group.1)

head(distress) #check the feature

Unnamed: 0_level_0,vlogId,distress
Unnamed: 0_level_1,<chr>,<dbl>
1,VLOG1,2.920563
2,VLOG10,3.135499
3,VLOG100,2.980318
4,VLOG102,2.980737
5,VLOG103,2.970042
6,VLOG104,2.914677


## Feature 5: Adjectives

In [18]:
adjective_data = read.table("../input/adjectives-data/adjective_list.txt", header = TRUE, sep = '')


adjective_df <- inner_join(transcripts_unnest, adjective_data, by = 'word')  


# Grouping adjectives use by vlogId
adjective_df <- adjective_df %>% 
group_by(vlogId) %>%
count(word)

# summing total number of adjectives used per vlogger 
adjectives <- 
    adjective_df[, 3] %>%
    aggregate(list(adjective_df$vlogId), sum) %>% # changing the column names so we can join them to another table by vlogId later 
  rename(total_adjective = n, vlogId = Group.1)

head(adjectives)

Unnamed: 0_level_0,vlogId,total_adjective
Unnamed: 0_level_1,<chr>,<int>
1,VLOG1,44
2,VLOG10,50
3,VLOG100,21
4,VLOG102,106
5,VLOG103,66
6,VLOG104,79


## Feature 6: Affin 

In [19]:

# FEATURE EXTRACTION: Afinn 

# Import AFINN

download.file("http://www2.imm.dtu.dk/pubdb/edoc/imm6010.zip","afinn.zip")
unzip("afinn.zip")
afinn <- read.delim("AFINN/AFINN-111.txt", sep="\t", col.names = c("word","score"), stringsAsFactors = FALSE)

# Do an inner join an essay token data fame and afinn word list

transcripts_affin <- 
    inner_join(transcripts_unnest, afinn, by = 'word') 

# grouping affin score by vlogId
transcripts_affin <- transcripts_affin %>% 
group_by(vlogId) %>%
count(word)

# summing total total affin score per vlogger 
affin <- 
    transcripts_affin$n %>%
    aggregate(list(transcripts_affin$vlogId), sum) %>% # changing the column names so we can join them to another table by vlogId later 
  rename(affin = x, vlogId = Group.1)

# unlist to make data appropriate for the regression analysis 
affin <- affin %>% mutate(affin = unlist(affin)) 
head(affin) 

Unnamed: 0_level_0,vlogId,affin
Unnamed: 0_level_1,<chr>,<int>
1,VLOG1,40
2,VLOG10,29
3,VLOG100,21
4,VLOG102,68
5,VLOG103,38
6,VLOG104,28


## Feature 7: NCR lexicon


In [20]:
####NRC OUTPUT

#Feature extraction load NRC word list
load_nrc = function() {
    if (!file.exists('nrc.txt'))
        download.file("https://www.dropbox.com/s/yo5o476zk8j5ujg/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt?dl=1","nrc.txt")
    nrc = read.table("nrc.txt", col.names=c('word','sentiment','applies'), stringsAsFactors = FALSE)
    nrc %>% filter(applies==1) %>% select(-applies)
}

nrc = load_nrc()
sample_n(nrc, 6)

nrc = load_nrc()

# Do an inner join an essay token data fame and nrc word list
transcript_ncr = 
    inner_join(transcripts_unnest, nrc, by = c(word = 'word')) 

# Peek at the result
head(transcript_ncr, 6)
dim(transcript_ncr)

#count the sentiment
transcript_sentiment_scores = 
    transcript_ncr %>%
    count(`vlogId`, sentiment) 

# Peek at the result
dim(transcript_sentiment_scores)
head(transcript_sentiment_scores)


#widen the table
sentiment = 
    transcript_sentiment_scores %>%
    spread(sentiment, n, fill = 0)

head(sentiment)

word,sentiment
<chr>,<chr>
radiant,joy
treason,fear
publicist,negative
assignee,trust
ineffable,positive
gorgeous,positive


vlogId,word,sentiment
<chr>,<chr>,<chr>
VLOG1,insult,anger
VLOG1,insult,disgust
VLOG1,insult,negative
VLOG1,insult,sadness
VLOG1,insult,surprise
VLOG1,trade,trust


vlogId,sentiment,n
<chr>,<chr>,<int>
VLOG1,anger,8
VLOG1,anticipation,10
VLOG1,disgust,8
VLOG1,fear,6
VLOG1,joy,7
VLOG1,negative,10


vlogId,anger,anticipation,disgust,fear,joy,negative,positive,sadness,surprise,trust
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
VLOG1,8,10,8,6,7,10,14,8,5,11
VLOG10,10,19,7,14,9,18,19,8,10,17
VLOG100,0,9,0,2,8,3,11,1,2,9
VLOG102,4,50,7,14,29,14,42,8,22,24
VLOG103,0,19,4,9,13,15,26,10,5,11
VLOG104,2,15,0,4,49,3,62,22,9,37


# Combining our features into one table

In [21]:
# combining all features to one table 
transcript_features_df <- vlogger_df %>%
    inner_join(word_count, by = "vlogId") %>%
    inner_join(word_length, by = "vlogId") %>%
    inner_join(empathy, by = "vlogId") %>%
    inner_join(distress, by = "vlogId") %>%
    inner_join(adjectives, by = "vlogId") %>%
    inner_join(affin, by = "vlogId") %>%
    inner_join(sentiment, by = "vlogId")

head(transcript_features_df)




Unnamed: 0_level_0,vlogId,gender,Extr,Agr,Cons,Emot,Open,n_words,word_length,empathy,⋯,anger,anticipation,disgust,fear,joy,negative,positive,sadness,surprise,trust
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG3,Female,5.0,5.0,4.6,5.3,4.4,375,4.130667,3.22287,⋯,1,11,2,2,7,2,10,4,6,9
2,VLOG5,Male,5.9,5.3,5.3,5.8,5.5,395,3.944304,3.174061,⋯,1,6,1,2,4,1,10,1,4,5
3,VLOG6,Male,5.4,4.8,4.4,4.8,5.7,622,4.146302,3.116102,⋯,5,11,4,4,19,12,29,3,11,24
4,VLOG7,Male,4.7,5.1,4.4,5.1,4.7,644,4.099379,3.134713,⋯,12,22,11,7,12,15,21,11,9,16
5,VLOG8,Female,,,,,,311,3.729904,3.159109,⋯,3,12,2,6,7,5,12,4,3,9
6,VLOG9,Female,5.6,5.0,4.0,4.2,4.9,873,4.019473,3.169586,⋯,9,17,8,8,15,18,30,5,11,21


# 3. Predictive model
We chose a multiple linear regression anaylsis because we want to predict a continuous outcome variable. First we tried the full model with all of our features. 

In [22]:
# Separating training and test data: 

# Filter out the missing values to get our training data 
training_data <- transcript_features_df %>%
    filter(!is.na(Extr))

# head(is.na(training_data)) # check whether the filter worked

# select test-data  
testset_vloggers <- transcript_features_df %>% 
    filter(is.na(Extr))

# check the filter
head(is.na(testset_vloggers))

# Full model -------------------------------------------------------------------------------------------------------------------------
# predict personality with our features
fit_mlm <- lm(cbind(Extr, Agr, Cons, Emot, Open) ~ n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_data)

summary(fit_mlm) # look at the significance and check model fit
sqrt(mean(resid(fit_mlm) ^ 2)) # RMSE


vlogId,gender,Extr,Agr,Cons,Emot,Open,n_words,word_length,empathy,⋯,anger,anticipation,disgust,fear,joy,negative,positive,sadness,surprise,trust
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False
False,False,True,True,True,True,True,False,False,False,⋯,False,False,False,False,False,False,False,False,False,False


Response Extr :

Call:
lm(formula = Extr ~ n_words + word_length + empathy + distress + 
    total_adjective + affin + anger + anticipation + fear + joy + 
    negative + positive + sadness + surprise + disgust + trust, 
    data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58951 -0.70097  0.07169  0.70424  2.10534 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)     -1.9225791  2.3245612  -0.827  0.40884   
n_words         -0.0002287  0.0005854  -0.391  0.69636   
word_length      0.3385940  0.3215940   1.053  0.29324   
empathy         -1.1707475  0.9854369  -1.188  0.23574   
distress         2.9412012  0.9630783   3.054  0.00246 **
total_adjective  0.0006348  0.0052245   0.121  0.90338   
affin            0.0068707  0.0045477   1.511  0.13187   
anger            0.0213527  0.0218924   0.975  0.33016   
anticipation    -0.0102445  0.0129634  -0.790  0.42999   
fear            -0.0137038  0.0167566  -0.818  0.41410

### Predictor Selection: 
We wanted to assess which features predict well for each personality measure. We chose to do this with a forward stepwise regression. Once the optimal models were chosen, we formed 


In [23]:
# Extr -------------------------------------------------------------------------------------------------------------------------------
# empty model 
fit_empty_extr <- lm(Extr ~ 1, data = training_data)
summary(fit_empty_extr)

# stepwise 
fit_extr <- step(fit_empty_extr, direction = "forward", scope = Extr ~n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_set, trace = 0)
summary(fit_extr)

#RMSE of Extraversion
sqrt(mean(resid(fit_extr) ^ 2))

# Agr --------------------------------------------------------------------------------------------------------------------------------
fit_empty_agr <- lm(Agr ~ 1, data = training_data)
summary(fit_empty_agr)

# stepwise 
fit_agr <- step(fit_empty_agr, direction = "forward", scope = Agr ~n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_set, trace = 0)
summary(fit_agr)

#RMSE of Extraversion
sqrt(mean(resid(fit_agr) ^ 2))

# Cons--------------------------------------------------------------------------------------------------------------------------------
fit_empty_cons <- lm(Cons ~ 1, data = training_data)
summary(fit_empty_cons)

# stepwise 
fit_cons <- step(fit_empty_agr, direction = "forward", scope = Cons ~ n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_set, trace = 0)
summary(fit_cons)

#RMSE of Extraversion
sqrt(mean(resid(fit_cons) ^ 2))

# Emot -------------------------------------------------------------------------------------------------------------------------------
# empty model 
fit_empty_emot <- lm(Emot~ 1, data = training_data)
summary(fit_empty_emot)

# stepwise 
fit_emot <- step(fit_empty_emot, direction = "forward", scope = Emot ~n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_set, trace = 0)
summary(fit_emot)

#RMSE of Openess 
sqrt(mean(resid(fit_emot) ^ 2))

# Open -------------------------------------------------------------------------------------------------------------------------------
# empty model 
fit_empty_open <- lm(Open ~ 1, data = training_data)
summary(fit_empty_open)

# stepwise 
fit_open <- step(fit_empty_open, direction = "forward", scope = Open ~ n_words + word_length + empathy +
             distress + total_adjective + affin + anger + anticipation + fear + 
            joy + negative + positive + sadness + surprise + disgust + trust, data = training_set, trace = 0)
summary(fit_open)

#RMSE of Openess 
sqrt(mean(resid(fit_open) ^ 2))




Call:
lm(formula = Extr ~ 1, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.62079 -0.72079  0.07921  0.77921  1.97921 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.62079    0.05383   85.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9674 on 322 degrees of freedom



Call:
lm(formula = Extr ~ distress + affin + fear, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.59718 -0.69036  0.02323  0.71732  2.02077 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.158350   1.800140  -1.199 0.231422    
distress     2.229566   0.602951   3.698 0.000256 ***
affin        0.006307   0.002221   2.839 0.004809 ** 
fear        -0.012895   0.008344  -1.546 0.123210    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9463 on 319 degrees of freedom
Multiple R-squared:  0.05208,	Adjusted R-squared:  0.04317 
F-statistic: 5.842 on 3 and 319 DF,  p-value: 0.0006777



Call:
lm(formula = Agr ~ 1, data = training_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6966 -0.4966  0.2034  0.6034  1.8034 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.69659    0.04958   94.73   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8911 on 322 degrees of freedom



Call:
lm(formula = Agr ~ disgust + empathy + distress + word_length + 
    surprise + anger + fear + sadness, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61039 -0.47677  0.07787  0.49990  1.84859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07247    1.67167   0.043  0.96545    
disgust     -0.03878    0.01477  -2.625  0.00908 ** 
empathy      5.01923    0.73986   6.784 5.82e-11 ***
distress    -2.86868    0.72720  -3.945 9.85e-05 ***
word_length -0.62326    0.23408  -2.663  0.00815 ** 
surprise     0.03319    0.01020   3.255  0.00126 ** 
anger       -0.04504    0.01528  -2.947  0.00345 ** 
fear         0.03534    0.01229   2.876  0.00430 ** 
sadness     -0.02942    0.01364  -2.157  0.03178 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7441 on 314 degrees of freedom
Multiple R-squared:   0.32,	Adjusted R-squared:  0.3027 
F-statistic: 18.47 on 8 and 314 DF,


Call:
lm(formula = Cons ~ 1, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61494 -0.41494  0.08506  0.48506  1.68506 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.51494    0.04393   102.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7895 on 322 degrees of freedom



Call:
lm(formula = Agr ~ disgust + empathy + distress + word_length + 
    surprise + anger + fear + sadness, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.61039 -0.47677  0.07787  0.49990  1.84859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.07247    1.67167   0.043  0.96545    
disgust     -0.03878    0.01477  -2.625  0.00908 ** 
empathy      5.01923    0.73986   6.784 5.82e-11 ***
distress    -2.86868    0.72720  -3.945 9.85e-05 ***
word_length -0.62326    0.23408  -2.663  0.00815 ** 
surprise     0.03319    0.01020   3.255  0.00126 ** 
anger       -0.04504    0.01528  -2.947  0.00345 ** 
fear         0.03534    0.01229   2.876  0.00430 ** 
sadness     -0.02942    0.01364  -2.157  0.03178 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7441 on 314 degrees of freedom
Multiple R-squared:   0.32,	Adjusted R-squared:  0.3027 
F-statistic: 18.47 on 8 and 314 DF,


Call:
lm(formula = Emot ~ 1, data = training_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5802 -0.3802  0.1198  0.5198  1.7198 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.78022    0.04293   111.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7715 on 322 degrees of freedom



Call:
lm(formula = Emot ~ disgust + positive + empathy + distress + 
    sadness + affin + total_adjective, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.23392 -0.41237  0.05529  0.47242  2.02312 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.926520   1.516006   1.930 0.054453 .  
disgust         -0.015428   0.014163  -1.089 0.276823    
positive         0.009373   0.004607   2.034 0.042755 *  
empathy          2.839578   0.703713   4.035 6.86e-05 ***
distress        -2.369776   0.694417  -3.413 0.000727 ***
sadness         -0.022468   0.011443  -1.963 0.050472 .  
affin           -0.006653   0.002885  -2.306 0.021757 *  
total_adjective  0.003869   0.002560   1.511 0.131721    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7157 on 315 degrees of freedom
Multiple R-squared:  0.1582,	Adjusted R-squared:  0.1395 
F-statistic:  8.46 on 7 and 315 DF,  p-value


Call:
lm(formula = Open ~ 1, data = training_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2659 -0.4659  0.0008  0.5341  1.6341 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.66586    0.03946   118.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7092 on 322 degrees of freedom



Call:
lm(formula = Open ~ empathy + fear + joy + anticipation, data = training_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.25247 -0.48369  0.00258  0.48515  1.51196 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.844669   1.373313   1.343   0.1802  
empathy       0.902493   0.437129   2.065   0.0398 *
fear         -0.011812   0.005705  -2.070   0.0392 *
joy           0.018940   0.007486   2.530   0.0119 *
anticipation -0.012943   0.007593  -1.705   0.0892 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6951 on 318 degrees of freedom
Multiple R-squared:  0.05125,	Adjusted R-squared:  0.03932 
F-statistic: 4.295 on 4 and 318 DF,  p-value: 0.002129


# 4. Making predictions on the test set


## Predictions

In this section we make predictions from the models selected through the stepwise regression. 

In [24]:
# new predictions from these models 
pred_extr <- predict(fit_extr, new = testset_vloggers)
pred_open <- predict(fit_open, new = testset_vloggers)
pred_cons <- predict(fit_cons, new = testset_vloggers)
pred_agr <- predict(fit_agr, new = testset_vloggers)
pred_emot <- predict(fit_emot, new = testset_vloggers)

# combine all the prediction vectors into one data frame 
testset_final_pred  = testset_vloggers %>% 
    mutate(
        Extr = pred_extr, 
        Agr  = pred_agr,
        Cons = pred_cons,
        Emot = pred_emot,
        Open = pred_open
    ) %>%
    select(vlogId, Extr:Open)

head(testset_final_pred )



# convert to competition format: 400 rows 
testset_final_pred <- testset_final_pred %>%
    as_tibble() %>%
    pivot_longer(c(Extr, Agr, Cons, Emot, Open), names_to = 'personality', values_to = 'Expected') 
testset_final_pred <- testset_final_pred  %>%
    unite(Id, vlogId, personality)

head(testset_final_pred)





Unnamed: 0_level_0,vlogId,Extr,Agr,Cons,Emot,Open
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,VLOG8,4.569289,5.121752,5.121752,4.735066,4.602135
2,VLOG15,4.543831,4.416545,4.416545,4.697841,4.426627
3,VLOG18,4.892155,5.232366,5.232366,5.029184,4.880817
4,VLOG22,5.389967,4.629905,4.629905,4.783131,4.991315
5,VLOG28,4.552347,4.76018,4.76018,4.498248,4.502251
6,VLOG29,4.886755,5.08275,5.08275,4.913666,4.847211


Id,Expected
<chr>,<dbl>
VLOG8_Extr,4.569289
VLOG8_Agr,5.121752
VLOG8_Cons,5.121752
VLOG8_Emot,4.735066
VLOG8_Open,4.602135
VLOG15_Extr,4.543831


## 4.3 Writing predictions to file

In [25]:
# Write to csv file 
write.csv(testset_final_pred, "predictions.final.csv")

# Check if the file was written successfully.
list.files()


Once you have clicked the <span style="background-color:#000000;color:white;padding:3px;border-radius:10px;padding-left:6px;padding-right:6px;">⟳ Save Version&nbsp;&nbsp;|&nbsp;&nbsp;0</span> button at the top left, and select the "Save & Run All (Commit)" option, go to the Viewer. There you will find your "predictions.csv" under Output. You'll also see a button there that allows you to submit your predictions with one click.

## Best Model based on RMSE: 
Our best model looked as follows: the four predictors were n_words + word_length + empathy + distress