In [None]:
## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

list.files(path = "../input")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.

In [None]:
library(gridExtra)
library(ggcorrplot)
library(dplyr)
library(RColorBrewer)
library(MASS)
library(gvlma)

In [None]:
my_list <- read.csv("../input/top-spotify-tracks-of-2018/top2018.csv")

In [None]:
head(my_list)

In [None]:
names(my_list)

In [None]:
summary(my_list)

In [None]:
colSums(is.na(my_list))

In [None]:
corr <- round(cor(my_list[,4:16]),8)
ggcorrplot(corr)

**It can be observed that the variables [loudness and energy] are correlated to some extent compared to the other variables**

In [None]:
dim(my_list)

In [None]:
sapply(my_list, class)

In [None]:
unique(my_list$key)
unique(my_list$mode)
unique(my_list$time_signature)

In [None]:
my_list$key <- as.factor(my_list$key)
my_list$mode <- as.factor(my_list$mode)
my_list$time_signature <- as.factor(my_list$time_signature)

In [None]:
sapply(my_list, class)

In [None]:
#converting the numerical keys to the actual musical keys
levels(my_list$key)[1]  <-"C"
levels(my_list$key)[2]  <-"C#"
levels(my_list$key)[3]  <-"D"
levels(my_list$key)[4]  <-"D#"
levels(my_list$key)[5]  <-"E"
levels(my_list$key)[6]  <-"F"
levels(my_list$key)[7]  <-"F#"
levels(my_list$key)[8]  <-"G"
levels(my_list$key)[9]  <-"G#"
levels(my_list$key)[10] <-"A"
levels(my_list$key)[11] <-"A#"
levels(my_list$key)[12] <-"B"

In [None]:
#convert the duration milliseonds to mins
my_list$duration_ms <- my_list$duration_ms/60000

In [None]:
#adding popularity column to the my_list
#cbind popularity with the my_list
popularity <- c(1:100)
my_list <- cbind(my_list,popularity)

In [None]:
my_list_1 <- my_list

In [None]:
my_list$valence[my_list$valence > 0.000 & my_list$valence <= 0.350 ] <- "sad"
my_list$valence[my_list$valence >= 0.351 & my_list$valence <= 0.700 ] <- "happy"
my_list$valence[my_list$valence >= 0.701 & my_list$valence <= 1.000 ] <- "Euphoric"



In [None]:
my_list$valence <- as.factor(my_list$valence)

In [None]:
 ggplot(my_list) + geom_density(aes(energy),fill="steelblue")

Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

It can be observed the highest intensity of energy level being greater than 0.6 i.e the measure of intensity is quite high for these songs.

In [None]:
ggplot(my_list) + geom_bar(aes(valence),fill="steelblue")

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. We have made different classes like

**Valence value < 0.350 as sad, 
0.351 < Valence value < 0.701 as happy 
Valence value > 0.700 as Euphoric **.

Tracks with high valence sound more positive.

In [None]:
ggplot(my_list) + geom_bar(aes(time_signature),fill="steelblue")

**Time Signature Analysis** The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).We could observe that the top 100 songs have a time signature of 4.


In [None]:
ggplot(my_list) + geom_density(aes(duration_ms),fill="steelblue")

**Duration Analysis** It can be observed that the maximum density is observed in between 3 to 4 mins.

In [None]:
ggplot(my_list) + geom_density(aes(danceability),fill="steelblue")

* **Dancebility analysis **:
Danceability describes how suitable a track is for dancing.A value of 0.0 is least danceable and 1.0 is most danceable.We could see that the distribution inclines more towards the value of 1.

In [None]:
ggplot(my_list) + geom_bar(aes(mode),width = 0.4,fill="steelblue")

In [None]:
ggplot(my_list) + geom_density(aes(speechiness),fill="steelblue")

* **Speechiness Analysis:**
If the speechiness of a song is above 0.66, it is probably made of spoken words, a score between 0.33 and 0.66 is a song that may contain both music and words, and a score below 0.33 means the song does not have any speech.Thus we can see that the maximum number speechiness observed here is less than 0.33 thus in the top 100 songs mostly does not have speech.

In [None]:
ggplot(my_list) + geom_density(aes(acousticness),fill="steelblue")

In [None]:
ggplot(my_list) + geom_bar(aes(key),width = 0.4,fill="steelblue")

* **Keys Analysis**
As per the music standards we have transformed the key data to analyse which key has been most popular.
It can be seen that the key C# has most number of occurences.

In [None]:
my_frequency <- data.frame(my_list %>% 
group_by(my_list$artists) %>% 
summarise(no_rows = n()) %>%
arrange(desc(no_rows)))


In [None]:
my_frequency_10 <- my_frequency[1:10,]

In [None]:
ggplot(my_frequency_10,aes(my_list.artists,no_rows))+
geom_bar(stat = "identity",width=0.4,fill="steelblue")+
labs(x="Top 10 Artists")+
labs(y="Count of Songs in top 100 List")+
labs(title = "Top 10 Artist Counts")

* **Top 10 Artists Analysis **: This is the plot for the top 10 frequencies of the artists in the given top 100.

1. Post Malone having 6 songs
2. XXXTENTACTION having 6 songs are the 2 artists with maximum number of songs in the top 100 list i.e the two artist are the most Popular artists.

In [None]:
ggplot(my_list,aes(popularity,acousticness)) + 
  geom_point(stat="identity")+
  geom_abline(intercept = 0.65,slope=0)+
  labs(x="Popularity")+
  labs(y="acousticness")+
  labs(title = "Popularity vs acousticness")+
  coord_polar()

* **Accousticness vs Popularity :**
As per the music standards the lesser the acoustic more is the inclusion of electric sounds.
We could see that the concentration of the values is towards centre i.e approx 0.35 thus we can say that the maximum songs have inclusion of the electric sounds.

In [None]:
ggplot(my_list,aes(popularity,instrumentalness)) + 
  geom_point(stat="identity")+
  labs(x="Popularity")+
  labs(y="instrumentalness")+
  labs(title = "Popularity vs instrumentalness")

* **Instrumentalness vs Popularity:**
Instrumentalness < 0.01 
As per the music standards Instrumentalness less than 0.33 means the songs does not have speech.

**Try to predict one audio feature based on the others**

We further tried predicting the variable *Energy* using multiple regression.

In [None]:
#Splitting data into test and train.

sample_data <- sample(2,nrow(my_list_1),replace=TRUE,prob = c(0.8,0.2))

train_data <- my_list[sample_data == 1,]
test_data <- my_list[sample_data == 2,]

In [None]:
#Model creation
fit_linear <- lm(energy ~ loudness+danceability+valence+speechiness+acousticness+instrumentalness+liveness , data=train_data)

In [None]:
gvlma(fit_linear)

In [None]:
#feature selection - Backward elimination

step_1 <- stepAIC(fit_linear,direction = "backward")
step_1$anova  

In [None]:
#Feature Selection - Both
step_2 <- stepAIC(fit_linear,direction = "both")
step_2$anova 

In [None]:
#Final Model:
fit_final <- lm(energy ~ loudness + danceability +acousticness, data = train_data)


In [None]:
gvlma(fit_final)

In [None]:
summary(fit_final) 

In [None]:
predicted <- predict(fit_final,newdata = test_data)

In [None]:
observed <- test_data$energy

In [None]:
predicted
observed

In [None]:
SSE <- sum((observed - predicted) ^ 2)
SST <- sum((observed - mean(observed)) ^ 2)
r2 <- 1 - SSE/SST
rmse <- sqrt(mean((predicted - observed)^2))

In [None]:
r2

**With the above model we obtained a considerable r2 value while predicting the variable energy using the variables loudness,danceability,acousticness **

**Conclusion**:

**Look for patterns in the audio features of the songs. Why do people stream these songs the most?**<br>
1.Energetic songs<br>
2.Time duration of the songs between 3-4 mins.<br>
3.High dancebility.<br>
4.Low Speechiness and low instrumentalness (very less speech in the songs)<br>
5.Low accousticness(more inclusion of electric sounds).<br>

**Try to predict one audio feature based on the others**<br>
With the above model we obtained a considerable r2 value while predicting the variable energy using the variables loudness,danceability,acousticness.

**See which features correlate the most**<br>
Energy and loudness correlate the most.