-
Notifications
You must be signed in to change notification settings - Fork 0
/
gcsendes_assignment.Rmd
336 lines (257 loc) · 12.7 KB
/
gcsendes_assignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "DV2 assignment"
author: "Gerold Csendes"
date: '2020 02 12 '
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libraries, echo=FALSE}
library(data.table)
library(ggplot2)
theme_set(theme_bw()) # set global theme for ggplot
```
```{r read, echo=TRUE, message=FALSE, warning=FALSE}
# Read data
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
```
```{r skim}
# Check data
dim(spotify_songs)
# Let's convert into data.table for efficiency reasons
spotify_songs <- as.data.table(spotify_songs)
# Skim the data to get some sense of variable attributes
id_cols <- c('track_id', 'track_album_id', 'playlist_id')
skimr::skim(spotify_songs[, setdiff(colnames(spotify_songs), id_cols)])
```
I think it was definitely worth skimming the data, since it provide a nice overview of the data. With that in hand, we are able to phrase relevant/interesting questions about the spotify dataset. I am going to classify these questions into two groups:
### 1. Data quality:
+ What is the **granularity** of the data? I would have first guessed that it is a track but you can
see in the skim output that it doesn't equal the number of rows. We better look into this!
+ There are some extremely long track/album/playlist names. Why is that?
+ There are a very little missing data but I am going to check whether they can be imputed or just dropped.
+ The distribution of the numeric variables seem oaky at first glance, so I will leave them as they are for the time being.
### 2. Data exploration: what insights can we infer from the data? I am rather interested in two main topics.
+ How are genres and subgenres related to each other?
+ What are the most popular songs/albums?
+ Can we cluster the songs so they will be belong to the same genre? In other words, can we validate the way we group data by genre by data?
+ I wanna see how the number of tracks in each genre increased over time.
Enough is said, let's get to work!
### Data quality
```{r data_quality}
# First check missing data
spotify_songs[is.na(track_name) ]
```
It seems like we have all the data except for attributes of the artist. This isn't really surprise because spotify_songs data.table seems to come from a transactional database and thus only the manually adminstered data such as artist name etc.. can be missing. Let's drop these observations
```{r data_quality2}
spotify_songs <- na.omit(spotify_songs)
data_dim <- dim(spotify_songs)
```
To check upon the granularity of the dataset, look for songs that are present more than once
```{r granularity}
# First check whether the rows are unique but I suspect it isn't the cause
dim(unique(spotify_songs)) # not it isnt't
# check song counts
dups <- spotify_songs[, .(song_cnt = .N), by = track_id][song_cnt > 1][order(-song_cnt)]
dups
dups_sum <- sum(dups$song_cnt)
# check the one with the most presence
spotify_songs[track_id == '7BKLCZ1jbUBVqRi2FVlTVw'][
,c('track_name', 'track_artist', 'track_album_name', 'playlist_name', 'playlist_id')]
```
It seems like that a track being present in more playlists is causing this. If this is the case, then track_id and playlist_id together should be the granularity of the data. Let's prove this:
```{r granularity2}
# which equals
dim(dups)[1] - dups_sum
```
This equals number of records - distinct track ids. This was important to check because we can be quite sure that not the song changed but it was just included in more playlists. We will need to keep this in mind when analyzing the song attributes to not to include the same song multiple times!
Next up is to inspect album name lengths.
```{r length}
# create var
spotify_songs[, album_name_length := nchar(track_album_name)]
# viz length distribution
ggplot(spotify_songs, aes(album_name_length)) +
geom_histogram(colour="darkblue", fill="lightblue") +
geom_vline(aes(xintercept=mean(album_name_length)),
color="purple", linetype="dashed", size=1) +
labs(title = "ALbum name char. length distribution") +
theme(plot.title = element_text(hjust = 0.5))
# check what can you make out of 50 chars
spotify_songs[album_name_length == 50][1:2,track_album_name]
# check what can you make out of 70 chars
spotify_songs[album_name_length == 70][1:2, track_album_name]
# check what can you make out of 50 chars
spotify_songs[album_name_length == 90][1:2, track_album_name]
# check what can you make out of 50 chars
spotify_songs[album_name_length == 110][, track_album_name]
```
Well, these seem to reasonable names. But what about playlists? They should have rather short and not too specific names. Let's do the same for playlists.
```{r length2}
# create var
spotify_songs[, playlist_name_length := nchar(playlist_name)]
# viz length distribution
ggplot(spotify_songs, aes(playlist_name_length)) +
geom_histogram(colour="darkblue", fill="lightblue") +
geom_vline(aes(xintercept=mean(album_name_length)),
color="purple", linetype="dashed", size=1) +
labs(title = "Playlist name char. number distribution") +
theme(plot.title = element_text(hjust = 0.5))
# check what can you make out of 50 chars
spotify_songs[playlist_name_length > 50][1, playlist_name]
# check what can you make out of 70 chars
spotify_songs[playlist_name_length > 100][1, unique(playlist_name)]
```
`U0001f47d` are emoji encodings, so if we want to really analyze playlist name lengths, then we should filter these out. It could be done by R's base `enc2native()` but I will just leave it this way because I just wanted to see name with long lenghts.
### EDA
Check how subgenres relate to genres
```{r genres}
spotify_songs[, unique(playlist_subgenre), by=playlist_genre]
```
From now on, the song attributes such as tempo, loudness etc.. will be analyzed. Since songs are duplicated, let's filter for unique songs.
```{r filter_to_first}
songs <- unique(spotify_songs, by = 'track_id')
dim(songs)
```
Which artists/albums are the most popular? Filter for more than 20 tracks.
```{r popularity}
# plot an artist ranking
# horizontal
ggplot(songs[, .(avg_popularity = mean(track_popularity),
pop_sd = sd(track_popularity),
track_count = .N
) , by = track_artist][track_count > 20][order(-avg_popularity)][1:10,]
) +
geom_bar( aes(x=reorder(track_artist, avg_popularity), y=avg_popularity), stat="identity", fill="skyblue", alpha=0.5) +
geom_errorbar( aes(x=track_artist,
ymin=avg_popularity-pop_sd,
ymax=avg_popularity+pop_sd), width=0.4, colour="orange", alpha=0.9, size=1.3) +
coord_flip() +
labs(title = 'Top 10 artists', x = 'artists')
```
The sd range is big due to the small number of observations. Do the same for albums. Here I think gotta be more cautios because I suspect album names to be not unique for different aritsts.
```{r }
# get best albums by popularity only consider albums w. at least 5 tracks
fav_albums <- songs[, .(avg_popularity = mean(track_popularity),
pop_sd = sd(track_popularity),
track_count = .N
) , by = track_album_id][order(-avg_popularity)][track_count > 5][1:10,]
# merge to get artist names
fav_albums <- merge(fav_albums,
unique(songs, by = c('track_album_id', 'track_album_name', 'track_artist'))
[, c('track_album_id', 'track_album_name', 'track_artist')],
all.x =T, all.y = F, by = 'track_album_id')
# create var to display album name + performer
fav_albums[, artist_album:= do.call(paste, c(.SD, sep = " - ")),
.SDcols=c('track_artist', 'track_album_name')]
ggplot(fav_albums) +
geom_bar( aes(x=reorder(artist_album, avg_popularity), y=avg_popularity), stat="identity",
fill="skyblue", alpha=0.5) +
coord_flip() +
labs(title = 'Top 10 artists', x = 'artists')
```
I am really interested whether the playlist genres really group the alike songs. I consider alike songs that have their attributes like energy and loudness in a similar range. To do this, let's apply PCA and cluster the songs and then check whether the clusters capture the genres.
```{r clusters}
library(caret)
library(NbClust)
library(factoextra)
pca_vars <- colnames(songs)
# extract numeric vars
pca_vars <- pca_vars[!(startsWith(pca_vars, 'track') | startsWith(pca_vars, 'playlist') |
startsWith(pca_vars, 'album'))]
pca_result <- prcomp(songs[, pca_vars, with=F], scale. = T)
(pca_result$rotation)[, seq(1:5)]
fviz_pca(pca_result, label = F, alpha = 0.2)
```
Well there is a lot of overlap. Lets color the songs (dots) by genre, we might explore patterns
```{r clsuters2}
# create a data table from the pca result
pca_songs <- as.data.table(pca_result$x)
# assign rows as ids
pca_songs[, id:=row.names(pca_songs)]
# do the same for the `original` dataset
songs[, id:=row.names(songs)]
ggplot(merge(pca_songs, songs[, c('id', 'playlist_genre')],
all.x = T, all.y = T, by = 'id'),
aes(PC1, PC2, color = playlist_genre)) +
geom_point(alpha = 0.2)
```
Well, there is a lot of data points and a lot of overlap. If I may try hard enough, I may find some patterns between the genres but just by looking at this graph, it is nearly impossible. We would be better of if we analyzed this problem analytically, for example clustering in this PCA space and calculate for example dissimilarity scores between the clusters. But that would be out of the scope of this assignment.
Let's make racing bars based on how many songs were made per year per subgenre.
This page was really helpful when doings this: [link](https://www.r-bloggers.com/how-to-create-a-bar-chart-race-in-r-mapping-united-states-city-population-1790-2010/)
```{r racing}
# create a Year col
songs[, Y := format(as.Date(track_album_release_date), "%Y") ]
# select relevant cols
data_racing <- songs[, c('Y', 'playlist_subgenre', 'track_id')]
data_racing <- na.omit(data_racing) # needed for the ggplot
data_racing <- data_racing[, .(cnt = .N), by = c('Y', 'playlist_subgenre' ) ]
setorderv(data_racing, c('Y', 'cnt'))
# cumulative count
data_racing[, cnt_cum := cumsum(cnt), by = c( 'playlist_subgenre')]
# not necessary col
data_racing[,cnt:=NULL]
data_racing[order(c('YM', 'playlist_subgenre'))]
# check if this works as expected
data_racing[playlist_subgenre == 'album rock'][1:8]
```
I really prefer pivoting data.frames and not data.tables
```{r libraries_again, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyr)
library(tidyverse)
```
```{r racing2}
# only to get the 'full' range of year for example there are years for which there was no tracks released in a genre
data_racing <- pivot_wider(data_racing, names_from = Y, values_from = cnt_cum )
data_racing <- pivot_longer(data_racing, -playlist_subgenre, names_to = "Y", values_to = "album_cumsum")
# forward fill values. If there was no track in 1970 in a genre then it will be imputed with the last value
data_racing <- data_racing %>%
group_by(playlist_subgenre) %>%
fill(album_cumsum)
# shape for ggplot
data_racing <- data_racing %>%
group_by(Y) %>%
arrange(-album_cumsum) %>%
mutate(rank = row_number()) %>%
filter( rank <= 10) %>%
arrange(Y)
```
```{r}
library(gganimate)
gen_subgenre <- unique(songs[, c('playlist_genre', 'playlist_subgenre')])
data_racing <- merge(data_racing, gen_subgenre, by = 'playlist_subgenre', all.x = T, all.y = F)
data_racing$Y <- as.numeric(data_racing$Y)
# write_rds(racing_rds, 'racing.rds')
p <- data_racing %>%
ggplot(aes(x=-rank, y=album_cumsum, fill = playlist_genre)) +
geom_tile(aes(y = album_cumsum/2, height=album_cumsum),width=0.9)+
geom_text(aes(label=playlist_subgenre),
hjust="right",
colour="black",
fontface="bold",
nudge_y=-100)+
geom_text(aes(label=scales::comma(album_cumsum)),
hjust="left",
nudge_y=50,
colour="grey30")+
theme_minimal() +
coord_flip(clip="off") +
scale_x_discrete("") +
scale_y_continuous("",labels=scales::comma)+
scale_fill_brewer(palette = 'RdYlBu') +
theme(panel.grid.major.y=element_blank(),
panel.grid.minor.x=element_blank(),
plot.title= element_text(size=20,colour="grey50",face="bold"),
plot.caption = element_text(colour="grey50"),
plot.subtitle = element_text(size=20,colour="grey50",face="bold"),
plot.margin = margin(1,1,1,2,"cm"),
axis.text.y=element_blank())+
transition_time(Y) +
labs(title='Number of tracks released in {round(frame_time,0)}',
caption='made by Gerold')
# I couldn't make this work so in an rmdown so I animated and saved the gif in an R script
# rendering a video would help solving this but would have taken too much time to deal with this
#animate(p, duration = 30, fps = 25, end_pause = 20)
# anim_save("racing-bars.gif")
```
![](racing-bars_noloop.gif)