In [None]:
suppressMessages(library(ggplot2))
suppressMessages(library(readr))
suppressMessages(library(dplyr))
suppressMessages(library(plyr))
suppressMessages(library(tidyr))
suppressMessages(library(tidytext))
suppressMessages(library(RColorBrewer))
suppressMessages(library(wordcloud2))
suppressMessages(library(stringr))

Reading the data from "songdata.csv"  

In [None]:
system("ls ../input", intern=TRUE)
song_data<-read.csv("../input/songlyrics/songdata.csv")
song_data_extended <- read.csv("../input/extended-song-data/extended.csv")

# A quick look at the data

In [None]:
str(song_data)
head(song_data)

The dataset contains 4 festures and 57650 rows. 
It has 44824 differnt musics, sung by 643 different artists.   

In [None]:
song_data$text<-as.character(song_data$text)

# Extending the data: plot the release date of songs
In order to generate lyrics, we supposed that using musics release date in the data set would help. In that way, using Spotify API, we wrote a script, extending our dataset with release date of each song being on spotify (~40k/55k). This helped us conclude that most of the musics in the dataset are between 1990 and 2010.

In [None]:
barplot(table(cut(song_data_extended$date, breaks = seq(1940, 2020, by = 1))), legend.text = "histogram of number of songs by year", xlab="year of release", ylab="number of occurences in dataset")

# Artists Wordcloud : we represent the song count of artists using a wordClound
This is a very finny and visual way to see the artist with most songs

Number of artists in the dataset

In [None]:
print(length(unique(song_data$artist)))

In [None]:
song_grp <- data.frame(table(song_data$artist))
colnames(song_grp) <- c("artist", "song_cnt")
wordcloud2(song_grp[1:600,], size = .5)

We ennumerate the Lyrics(each word)  for each song by using the unnest_token() function of the tidytext library.   

In [None]:
tidy_lyrics<- tidy_lyrics <- song_data %>% unnest_tokens(word,text)
head(tidy_lyrics)

# Words wordCloud
Displaying in a wordcloud the most used word in titles


In [None]:
mostWordsInTitles <- tidy_lyrics %>% filter(!word %in% stop_words$word) %>% dplyr::count(word,sort = TRUE) %>% top_n(500, n)
wordcloud2(mostWordsInTitles %>% top_n(100, n))

Displaying in a wordcloud the most used word in lyrics

In [None]:
mostWordsInLyrics <- tidy_lyrics %>% filter(!word %in% stop_words$word) %>% dplyr::count(word,sort = TRUE) %>% top_n(500, n)
wordcloud2(mostWordsInLyrics %>% top_n(100, n))

# Statistics about lyrics length

In [None]:
musics <- data.frame(song_data$artist,song_data$song, song_data$text)
musics$textlength = str_count(song_data$text, '\\w+')
meanlyricslength <- ddply(musics, .(song_data.artist), summarize,  Rate1=mean(textlength))

Minimum mean of lyrics length of artists

In [None]:
print(min(meanlyricslength$Rate1))

Maximum mean of lyrics length of artists

In [None]:
print(max(meanlyricslength$Rate1))

Mean length of a song lyrics in the dataset

In [None]:
print(mean(meanlyricslength$Rate1))

Median length of a song lyrics in the dataset

In [None]:
print(median(meanlyricslength$Rate1))

Quartiles of the mean lyrics length of artists

In [None]:
print(quantile(meanlyricslength$Rate1))

Standard deviation of the mean lyrics length of artists

In [None]:
print(sqrt(var(meanlyricslength$Rate1)))

We the plot the distribution of the mean of lyrics length per artist
And observe that it indeed follows a normal distribution with a mean of 240 and standard deviation of 93

In [None]:
x <- meanlyricslength$Rate1
h<-hist(x, breaks=10, col="red", xlab="Number of words in lyrics",ylab = "Number of songs", main="Distribution of the mean of lyrics length per artist") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2)

Count the number of words per song 

In [None]:
song_wrd_count<-  data.frame(table(tidy_lyrics$song))
colnames(song_wrd_count) <- c("song", "n")
head(song_wrd_count)

Plotting the songs which has more words and less words(top/bottom 10 songs).  

In [None]:
song_wrd_count %>% arrange(desc(n))%>%top_n(n=10)%>%ggplot(aes(x=factor(song,levels=song),y=n))+geom_col(fill="blue",size=1)+labs(x="song",y="word count",title="Words per song-Top 10")+coord_flip()
song_wrd_count %>% arrange(desc(n))%>%tail(n=10)%>%ggplot(aes(x=factor(song,levels=song),y=n))+geom_col(fill="blue",size=1)+labs(x="song",y="word count",title="Which song has very less words")+theme(axis.text.x = element_text(angle=90))+coord_flip()

Display top 10 artists with biggest mean (lyrics length)

In [None]:
topten <- meanlyricslength %>% top_n(10, Rate1)
topten <- topten[order(topten$Rate1),]
barplot(topten$Rate1, legend = topten$song_data.artist, col = c("lightblue", "mistyrose", "lavender","indianred", "red", 
                                                           "deepskyblue", "lightskyblue", "palevioletred","skyblue2", "yellow"), border = "dark blue")
title(main = "Top 10 artists with biggest mean lyrics length")

Display top 10 artists with lowest mean (lyrics length)

In [None]:
bottomten <- meanlyricslength %>% top_n(-10, Rate1)
bottomten <- bottomten[order(bottomten$Rate1),]
barplot(bottomten$Rate1, legend = bottomten$song_data.artist, col = c("lightblue", "mistyrose", "lavender","indianred", "red", 
                                                                 "deepskyblue", "lightskyblue", "palevioletred","skyblue2", "yellow"), border = "dark blue")
title(main = "Top 10 artists with lowest lyrics length")

# NRC Lexicon

We first combine the word count with "tidy_lyrics" with left_join. Then  join the "tidy_lyrics" with the lexicon "nrc" to get  the sentiments for each word in the lyrics.
NRC lexicon has 10 categories of sentiment:
anger  anticipation  disgust  
fear  joy  negative  positive  
sadness  surprise  trust  

We decided to remove the categories, positive, negative and anticipation because they don't give a very specific emotion and we want to be precise when we categorise our data for the second part of the study

## Plotting the top 5 words under each sentiment category

In [None]:
lyric_nb <- tidy_lyrics %>% left_join(song_wrd_count, by = "song")%>% dplyr::rename(total_words=n)
head(lyric_nb)

lyric_sentiment<-tidy_lyrics %>% inner_join(get_sentiments("nrc"),by="word")
lyric_sentiment %>% filter(!sentiment %in% c("positive","negative","anticipation")) %>% dplyr::count(word,sentiment,sort=TRUE)%>%group_by(sentiment)%>%top_n(n=5) %>%
  ggplot(aes(x=reorder(word,n),y=n,fill=sentiment))+geom_col(show.legend = TRUE)+facet_wrap(~sentiment,scales="free")+coord_flip()

Plotting the top 5 songs under each sentiment Category  

In [None]:
lyric_sentiment %>% filter(!sentiment %in% c("positive","negative","anticipation")) %>% dplyr::count(song,sentiment,sort=TRUE)%>%group_by(sentiment)%>%top_n(n=5)%>%ggplot(aes(x=reorder(song,n),y=n,fill=sentiment)) +
geom_bar(stat="identity",show.legend = FALSE)+facet_wrap(~sentiment,scales="free")+coord_flip()

List of the number of occurence of a words per artist


In [None]:
lyrics_per_artist<-tidy_lyrics %>% filter(!word %in% stop_words$word) %>% dplyr::count(artist, word, sort=TRUE) %>% group_by(artist)
lyrics_per_artist <- lyrics_per_artist[order(lyrics_per_artist$artist),]
worldcloud_data = lyrics_per_artist[c("word","n")]
head(worldcloud_data)
wordcloud2(worldcloud_data[1:200,])

# Curse words analysis
We wanted to see the most used words of artists compared to an array of words we created.
In this case we will try and see the artists that curse the most. We expected all of the top 50 or top 100 to be rappers.

In [None]:
#Count of number of curse words used by each artist. We can clearly see that top cursing artist are rappers which is quite logic !!
curse_words <- c("fuck", "fag", "dick", "tits", "pussy", "ho", "hoe", "hoes", "ass", "n-word", "shit", "cock", "bitch", "cunt", "niger", "nigger", "niggers")
curse_words_count_word <- lyrics_per_artist %>% subset(word %in% curse_words) 
curse_words_count <- curse_words_count_word[c("artist","n")]
curse_words_count <-  aggregate(n ~ artist, data = curse_words_count, sum)
curse_words_count <- curse_words_count[order(-curse_words_count$n),]
print(curse_words_count %>% top_n(100))

As expected the top cursing artists are all rappers but there is one artist that we thought should not be there.
Lata Mangeshkar is in the top list. He is a 70s indian artist and should not curse that much.
To see if it's truly the case, we printed the curse words he used

In [None]:
lata_mangeshkar_words <-  curse_words_count_word %>% subset(artist %in% c("Lata Mangeshkar")) 
head(lata_mangeshkar_words)

We can clearly see that the only word is ho. And it was used 64 times, ALONE. Which is quite weird. After some research we understand that the word "ho" in hindu translates to "to be" in english and exmplains why he used it so much. We can say that this value is biased. But out of the top 100 we found only one outlier.