-
Notifications
You must be signed in to change notification settings - Fork 0
/
epica-post.qmd
733 lines (584 loc) · 38.6 KB
/
epica-post.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
---
title: "Twenty years of Epica: The Empi(R)ics"
author: "Daniel Sánchez"
date: "2023-09-15"
description: "As a (late) tribute to Epica's *The Phantom Agony* twentieth birthday, I analyse Epica's empirics by accessing the Genius, Last.fm and Spotify APIs with R to analyse the Dutch band's musical features, streams and lyrics."
categories: [english, music, text analysis, data viz, epica, metal]
csl: ../../econometrics.csl
image: wordcloud.png
knitr:
opts_chunk:
message: false
warning: false
echo: true
include: true
bibliography: references.bib
format:
html:
code-fold: true
code-tools: true
code-summary: "Show the code"
code-overflow: wrap
---
```{=html}
<iframe src="wordcloud_phantom_agony.html" width="96%" height="500"></iframe>
```
*A wordcloud of Epica's most common lyrics. Made with the wordcloud2 package in R.*
```{r}
#| label: setup
#| echo: false
# Load libraries
library(dplyr)
library(ggplot2)
library(lastfmR) # Use remotes::install_github("ppatrzyk/lastfmR") to install the lastfmR package
library(spotifyr)
library(fuzzyjoin)
library(tidyr)
library(lubridate)
library(fixest)
library(broom )
library(geniusr)
library(tidytext)
library(wordcloud)
library(stringr)
library(here)
library(ggthemes)
library(ggdark)
library(showtext)
library(fedmatch)
library(ggimage)
library(ggtext)
library(wordcloud2)
# Add aesthetics
font_add_google("Questrial", family = "Questrial")
showtext_auto()
theme_daniel<-
theme_hc(style = 'darkunica',
base_family = 'Questrial',
base_size = 40) +
theme(axis.line.y = element_line(colour = 'white'),
axis.line.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
plot.caption = element_text(hjust = 0, face = 'italic'),
plot.title.position = 'plot',
plot.caption.position = 'plot',
plot.subtitle = element_text(lineheight = 0.5, vjust = 0.5),
axis.ticks = element_blank(),
plot.title = element_text(size = 60),
axis.text = element_text(colour = 'gray'))
# Execute my credentials (unstaged in the repository for privacy reasons)
source('credentials.R')
```
> *Epica could record an album of Karol G covers and no one would bat an eye* - [Last Rites](https://yourlastrites.com/2022/11/17/epica-the-alchemy-project-review/#:~:text=Though%20Epica%20has%20since%20normalized,given%20the%20diversity%20of%20its), 2022
On June 5th, 2003, dutch-based symphonic metal band [Epica](https://www.epica.nl/home) released their first full-length album, [*The Phantom Agony*](https://open.spotify.com/album/5qBmY4zyWEYP8bNJsq9Xjf?si=fJdSpTbvT4STnJJf3FweTw), the first of eight albums on the band's successful trajectory in the metal scene. Twenty years after, Epica is alive and kicking after having released a full-length album in 2021, [*Omega*](https://open.spotify.com/album/4Hb5cbP4Z9H9G1HavAra4q?si=9oHGzf-CRferS5kk9cxbsA), an EP in 2022, [*The Alchemy Project*](https://open.spotify.com/album/5So2cOEFCkkxZ7o5DAvpPH?si=5NCuejwaT3CIfHz1qzAG-w), and were headliners of Metallica's M72 2023 World Tour concerts in Europe. Epica's lengthy trajectory makeS the band a great case study about how death/symphonic metal has adapted to the modern, stream-based music industry. In this post, yours truly uses API data from Genius, Last.fm and Spotify to look at the empirics behind Epica's musical trajectory, with an emphasis on their first album *The Phantom Agony*.
## What makes Epica different?
Epica is *unique*. Though that is something that any die-hard fan will tell you, Epica has several factors which make them stand out in the genre. Most importantly, they combine gutural growls from death metal with female soprano vocals and string orchestras. Epica started this particular style on *The Phantom Agony*, after Dutch guitarist and singer Mark Jansen left After Forever and founded the band with Dutch singer Simone Simons.
To look at how Epica has constructed its musical features, we can use Spotify-computed audio features, which rank songs based on energy, danceability, valence, among others [@duman2022]. The Spotify API can be accessed with R through the [*spotifyr*](https://www.rcharlie.com/spotifyr/) package, but it needs users to create a [Spotify Developer account](https://developer.spotify.com/) and request a Client ID and a Client Secret. For a comprehensive tutorial on how to do this, go [here](https://msmith7161.github.io/what-is-speechiness/).
With the credentials, set them up as system environment variables through a call to `Sys.setenv(SPOTIFY_CLIENT_ID = 'something')` and `Sys.setenv(SPOTIFY_CLIENT_SECRET= 'something')`. Further, since we're going to access data from Epica, it's useful to set up the Spotify identificator for the band, which can be found in the Spotify [artist's link](https://open.spotify.com/artist/5HA5aLY3jJV7eimXWkRBBp?si=7BH2xg27Q9-gjo7UC25-FA), after the `artist/` and before the interrogation sign `?`. With that, we can acquire all albums from Epica available in Spotify, as well as audio metrics per track. The data requires quite a bit of preparation for *ggplot2* graphics, but trust me, it's worth it. Below, I acquire track-level audio metrics and then group at the album level[^1] by taking averages of all metrics.
[^1]: I only consider Epica's eight full-length releases: *The Phantom Agony (2003)*, *Consign To Oblivion (2005)*, *The Divine Conspiracy (2007)*, *Design Your Universe (2009)*, *Requiem for the Indifferent (2012)*, *The Quantum Enigma (2014)*, *The Holographic Principle (2016)*, *Omega (2021)* and their two EPs, *The Solace System (2017)* and *The Alchemy Project (2022)*. I leave out the *Attack on Titan* cover album (which does not make *Crimson Bow and Arrow* any less epic), singles and live albums for a more centered analysis.
::: callout-tip
You can display the code which generated the graphs using the little down arrow in the code blocks.
:::
```{r}
#| label: albums-tracks-spotify
#| fig-width: 12
#| fig-height: 8
# Setup the access token
access_token <- get_spotify_access_token(client_id = Sys.getenv('SPOTIFY_CLIENT_ID'),
client_secret = Sys.getenv('SPOTIFY_CLIENT_SECRET'))
# Define an album/EPs vector, a year vector, to use later as well as the artist id
epica_spotify_id <- '5HA5aLY3jJV7eimXWkRBBp'
epica_releases_year <-
data.frame(
release_spotify = c("The Phantom Agony (Expanded Edition)",
"Consign To Oblivion (Expanded Edition)",
"The Divine Conspiracy",
"Design Your Universe",
"Requiem for the Indifferent",
"The Quantum Enigma",
"The Holographic Principle",
"The Solace System",
"Omega",
"The Alchemy Project"),
release_name = c("The Phantom Agony",
"Consign To Oblivion",
"The Divine Conspiracy",
"Design Your Universe",
"Requiem for the Indifferent",
"The Quantum Enigma",
"The Holographic Principle",
"The Solace System",
"Omega",
"The Alchemy Project"),
release_year = c(2003, 2005, 2007, 2009, 2012, 2014, 2016, 2017, 2021, 2022)
) %>%
mutate(release_name_year = paste(release_name,' ', '(', release_year, ')', sep = ''))
# Get metrics at the track level and filter for the selection of releases I chose
epica_track_metrics <-
get_artist_audio_features(epica_spotify_id, include_groups = c('album', 'single')) %>%
select(track_name,
album_name,
album_id,
energy,
loudness,
speechiness,
acousticness,
instrumentalness,
liveness,
valence,
tempo) %>%
filter(album_name %in% epica_releases_year$release_spotify)
# Group at the album level, reshape to long form, then plot
epica_album_metrics <-
epica_track_metrics %>%
group_by(album_name) %>%
summarise_if(is.numeric, mean) %>%
left_join(epica_releases_year, by = c('album_name' = 'release_spotify')) %>%
arrange(desc(release_year)) %>%
select(-album_name) %>%
mutate(loudness = abs(loudness)) %>%
relocate(release_name, release_name_year, release_year) %>%
gather(metric, value, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo)
# Create an ordering for my metrics
orderings <-
data.frame(
metric = c('energy', 'valence', 'instrumentalness', 'speechiness'),
order = c(4,3,2,1)
)
# Graph
epica_album_metrics %>%
filter(metric %in% c('energy', 'valence', 'instrumentalness', 'speechiness')) %>%
left_join(orderings, by = 'metric') %>%
mutate(metric = factor(str_to_sentence(metric))) %>%
ggplot(aes(reorder(release_name_year, release_year), value, fill = reorder(metric, order))) +
geom_col(width = 0.7, position = position_dodge(width = 0.7)) +
coord_flip() +
labs(title = 'The Sound of Epica 2000-2023',
subtitle = "Spotify audio metrics for all Epica albums & EPs \n",
y = '',
x = '',
caption = str_wrap('Note: Spotify provides audio metrics at the track level. The horizontal axis shows the average value of the metric, grouping by band release. Data obtained from the Spotify API through the spotifyr package.', 120),
fill = 'Metric') +
theme_daniel +
theme(axis.text.x = element_blank(),
axis.text.y = element_text(colour = 'gray'),
legend.position = c(0.65, 1.10),
legend.margin = margin(0),
legend.direction = 'horizontal') +
guides(fill = guide_legend(nrow = 2))
```
Epica's characteristic sound, which meddles the harmonious verses by the female lead with violent male growls became known to the world in a single, *Cry for the Moon*, released 2003 and included in their first album. *Cry for the Moon* remains as one of Epica's most important tracks, as it set the scene for the what was to come later. Curiously, this first album is not the most instrumental of them all, ranking fourth in terms of instrumentalness according to Spotify. The most instrumental release by Epica is actually *Consign to Oblivion* (2005), often considered as the band's *magnum opus*, followed by 2016's *The Holographic Principle*. The data show an apparent trade-off between energy and instrumentality, an intuitive idea which is only challenged by the unique *Design Your Universe* (2009) with high instrumentality and energy. This last release contained the unforgettable *Unleashed*, probably Epica's most mainstream track to date.
Somewhat of a pattern can be found in the band's last two releases: the instrumentality fades, apparently in favour of more energy. Is this the way that Epica makes its way into today's hyper competitive and commercial music industry? A question like this has no clear answer (and perhaps does not deserve one), but the strategy here might be one of making a subtle impact. With newer releases which are *friendler* to the unexperienced listeners, Epica subtly enters the scene to the create an *obsessive devotion* after the innocent streamer turns to mindless fanatic after listening to the band's more loyal records of old.
# An attempt at stream analysis with Last.fm scrobbles
What comes to one's mind when reading about music data analysis is that stream data is key to understand listening patterns. With today's stream-based music industry, data on music preferences must be infinitely better than whatever it used to be some decades ago. However, two problems arise with stream data. The first and most mundane is that *stream data is not easily accesible*, probably due to the proprietary nature of the platforms that collect the data. The second and more abstract issue with stream analysis is that, even though I am not a professional musician myself, I am pretty sure that no serious artist writes their music based on data (or at least they shouldn't).
My answer is *because we can*, or at least because there's an attempt to be made, and it aids the non-professional musician's understanding of the world of music. And while you won't see Mark Jansen trying to deploy a Shiny app, you *will* see him sitting down reading scientific literature for lyrics [@metal-exposure2021]. Epica's inspiration on scientific knowledge for their lyrics more or less justifies a scientific approach when trying to understand their music.
Having that in mind, in this section I present an attempt at stream analysis with publicly available stream data from Last.fm. Last.fm is an online platform which tracks data from users through scrobbling, a way to "scrape" the data from music platforms and native media players and upload it to the user's Last.fm account. A listener, whoever, needs to self-select into creating a Last.fm account and connecting their streaming websites or computer's native media players to Last.fm. This data becomes publicly available to compute recommendations and compatibility analyses for the users. While it certainly leaves out an important chunk of the population (all of those who don't use Last.fm i.e. selection bias), it's better than nothing at all.
By using the [*lastfmR*](package) from Piotr Patrzyk[^2], which connects to the Last.fm API, I can pull data from all users who have scrobbled Epica.
[^2]: Run `remotes::install_github("ppatrzyk/lastfmR")` in the console and you will access the package easily enough.
```{r}
#| label: last-fm
#| echo: true
#| results: hide
# Get all tracks scrobbled and clean up the track names as they are *very* dirty
# This will remove all of the "part" words from the track names, which are pretty common
remove_parts<- c('pt 1', 'pt 2', 'pt 3', 'pt 4', 'pt 5', 'pt 6', 'pt 7', 'pt 8', 'pt 9', 'pt 10', 'pt 11', 'pt 12', 'pt 13', 'pt 14', 'pt 15',
'pt1', 'pt2', 'pt3', 'pt4', 'pt5', 'pt6', 'pt7', 'pt8', 'pt9', 'pt10', 'pt11', 'pt12', 'pt13', 'pt14', 'pt15',
'part 1', 'part 2', 'part 3', 'part 4', 'part 5', 'part 6', 'part 7', 'part 8', 'part 9', 'part 10', 'part 11', 'part 12', 'part 13', 'part 14', 'part 15',
'part1', 'part2', 'part3', 'part4', 'part5', 'part6', 'part7', 'part8', 'part9', 'part10', 'part11', 'part12', 'part13', 'part14', 'part15', 'part',
'part i', 'part ii', 'part iii', 'part iv', 'part v', 'part vi', 'part vii', 'part viii', 'part ix', 'part x', 'part xi', 'part xii', 'part xiii', 'part xiv', 'part xv',
'01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0')
epica_tracks_lastfm <-
lastfmR::get_tracks('Epica') %>%
mutate(track_clean = clean_strings(track, remove_char = remove_parts))
# Group by track name and sum scrobbles to later join with the spotify data
track_scrobbles <-
epica_tracks_lastfm %>%
group_by(track_clean) %>%
summarise(scrobbles = sum(scrobbles)) %>%
ungroup() %>%
mutate(track_clean = clean_strings(track_clean, remove_char = remove_parts),
live = if_else(track_clean %>% str_like('%live%'), 'live', 'not live')) %>%
filter(live == 'not live',
track_clean != ' ',
!is.na(track_clean),
!(track_clean %in% c('i', 'ii', 'iii', 'iv', 'v', 'vi', 'viii', 'viii', 'ix', 'x')))
# Inner join the songs dataset from lastfm to spotify track metrics based on "clean" track names
epica_tracks <-
epica_track_metrics %>%
transmute(track_name,
album_name,
album_id,
track_clean = clean_strings(track_name, remove_char = remove_parts)) %>%
inner_join(track_scrobbles, by = 'track_clean')
epica_tracks_joined <-
epica_track_metrics %>%
transmute(track_name,
album_name,
album_id,
track_clean = clean_strings(track_name, remove_char = remove_parts)) %>%
stringdist_inner_join(track_scrobbles,
by = 'track_clean',
method = 'dl',
max_dist = 2,
distance_col = 'dist')
epica_tracks %>% summarise(sum(scrobbles))
epica_tracks_joined %>% summarise(sum(scrobbles))
epica_tracks_joined %>% select(track_clean.x, track_clean.y, dist) %>% arrange(desc(dist))
# Create scrobbles by album and join with the album metrics and with album table to get the year
scrobbles_by_album <-
epica_tracks %>%
left_join(epica_releases_year, by = c('album_name' = 'release_spotify')) %>%
group_by(release_name_year, release_year) %>%
summarise(scrobbles = sum(scrobbles)) %>%
ungroup() %>%
arrange(desc(release_year))
```
```{r}
#| label: scrobbles-by-album
#| echo: false
#| fig-width: 12
#| fig-height: 8
# Include the epica logo in the dataframe
scrobbles_by_album %>%
filter(release_year < 2021) %>%
mutate(image2 = 'snake-logo1.png',
scrobbles1 = scrobbles/1000000) %>%
ggplot(aes(scrobbles1, reorder(release_name_year, scrobbles1))) +
geom_segment(aes(yend = release_name_year),
xend = 0,
colour = "gray",
linetype = "dashed") +
geom_image(aes(image = image2), size = 0.08) +
labs(x = "Millions of Last.fm scrobbles",
y = "",
caption = str_wrap("Note: Scrobbles are records of Last.fm users who chose to report the songs they listen to the platform. Data obtained through the Last.fm and Spotify APIs through the lastfmR and spotifyR packages. A mapping of the Last.fm dataset was done to the Spotify album-track datasets using Damerau-Levenshtein string distance, so data may not show all true scrobbles. Snake logo created and owned by EPICA.", 120),
title = "Epica's most popular albums were released in the 2000s",
subtitle = "The Classical Conspiracy (2007) remains a fan favourite with over 5M scrobbles")+
theme_daniel +
theme(axis.text.y = element_text(colour = 'gray'),
axis.text.x = element_text(colour = 'gray'),
axis.line.x = element_line(colour = 'white'),
axis.ticks.x = element_line(colour = 'white'))
```
The streams were obtained using the *lastfmR* package, which allows to obtain the number of streams of all songs which are mapped to most artists. However, two issues exist with the use of *lastfmR* for performing stream analysis. First, tracks are scrobbled from users *with the exact same name that they have in the user's computer or stream platform*. This is not an issue for Last.fm users who are streaming from platforms like Spotify, as track names are standard. However, for users scrobbling from their own media players (e.g. iTunes) you may get messy track names, making it difficult to map tracks to albums. I used Damerau-Levenshtein string distance to perfom fuzzy joins, however, the possibility of 100+ entries for the same track in the Last.fm makes it possible to only observe a portion of the scrobbles, as the API limits scrobble data to 10,000 rows.
The other issue is impossible to alleviate: we only observe a minority of stream users as probably only a small portion of all Epica listeners choose to record their listening patterns with Last.fm. All results from the analysis of this dataset could be significantly biased (which is why it is probably not worth running any kind of statistical model using this data). Access to the full dataset is likely only available to platform insiders and artists themselves, so at least any kind of advanced regression analysis is theoretically possible and useful for these parties.
Thankfully, the figure above shows coherent results. The most listened albums are those released in the 2000s, as they've had more time to be scrobbled by users. This makes sense for non-mainstream artists like Epica, however, it is also possible that listeners appreciate the more symphonic albums, among them *The Phantom Agony*. Because of the very small difference in scrobbles between *Consign to Oblivion* and *The Divine Conspiracy*, it isn't possible to say which of the two is the most popular. However, with the data at hand, it is safe to say that Epica's most popular releases were released in the previous decade, when modern metal was considerably more popular relative to the 2010s. Epica's latest effort, *The Alchemy Project*, released in 2022, seems to be the weakest, however, given the short amount of time that has passed, one could expect streams to gradually increase to the same levels of the late 2010s releases.
# Using the Genius API for lyric text analysis
Apart from Epica's characteristic blend of symphony and traditional heavy metal, another element which makes them stand out are lyrics. Unlike much of today's lyrics, Epica's are not easily accessible: all deal with complex ideas which motivate reflection by the listener on prevalent world themes. Moreover, the lyrics are strongly multiligual, with a strong emphasis on the use of Latin verses.
Apart from their artistic value, which music experters are better able to understand than me, rich lyrics allow for extensive and awesome *text analysis*. With the existence of the amazing [*geniusr*](https://ewenme.github.io/geniusr/) package, text analysis of musical lyrics is almost effortless[^3], as it allows to seamlessly download lyrics for songs of any artist on their database. The package is well documented and has intuitive functions; for text analysis, I used the *tidytext* package, which helps clean the text into tokens (words) and then eliminates common words (the, a) to facilitate analysis.
[^3]: What took most effort was making the *geniusr* package work properly. At the time of this post's writing, it was necessary to install a fork of the package with the code `remotes::install_github('giovanni-cutri/geniusR')`. You must have the *remotes* package installed.
```{r}
#| label: text-analysis
#| cache: true
#|
# Get the genius artist id for Epica using their artist search function
epica_id <-
search_artist('epica') %>%
select(artist_id) %>%
as.vector()
# Look for a track in the album to get the Genius album id
cry_for_the_moon_id <-
search_song('cry for the moon') %>%
slice(1) %>%
select(song_id) %>%
as.vector()
# Get data for cry for the moon
cry_for_the_moon_df <-
get_song_df(cry_for_the_moon_id)
# Get the album id for The Phantom Agony in the data of Cry for the Moon
the_phantom_agony_id <-
cry_for_the_moon_df %>%
select(album_id) %>%
as.vector()
# Get all songs in the album and their ids
the_phantom_agony_songs <-
get_album_tracklist_id(the_phantom_agony_id) %>%
inner_join(get_artist_songs_df(epica_id) %>%
select(song_name, song_id),
by = c('song_title' = 'song_name'))
track_ids <-
the_phantom_agony_songs$song_id
# Extract lyrics for all tracks and combine into a single dataframe
phantom_agony_lyrics <-
lapply(track_ids, get_lyrics_id) %>%
bind_rows() %>%
slice(-1:-2)
# Tokenize the lyrics and antijoin with stop words
phantom_agony_words <-
phantom_agony_lyrics %>%
unnest_tokens(output = word,
input = line) %>%
anti_join(stop_words, by = 'word')
# A word count df
phantom_agony_wordcount <-
phantom_agony_words %>%
count(word) %>%
rename(freq = 'n') %>%
arrange(desc(freq))
# Define colours
bright_orange <- "#FFA500" # hex code for bright orange
dark_red <- "#8B0000" # hex code for dark red
golden_yellow <- "#FFD700" # hex code for golden yellow
# Create color palette
color_palette <- colorRampPalette(c(bright_orange, dark_red, golden_yellow))
# Generate a palette of 100 colors
colors <- color_palette(100)
set.seed(123)
wc_phantom_agony <-
wordcloud2(data = phantom_agony_wordcount %>% head(100),
fontFamily = 'Garamond',
shape = 'diamond',
backgroundColor = 'grey30',
color = colors)
htmlwidgets::saveWidget(wc_phantom_agony,
"wordcloud_phantom_agony.html",
selfcontained = TRUE)
```
```{r}
#| label: word-count-graph
#| fig-width: 12
#| fig-height: 8
phantom_agony_wordcount %>%
top_n(10) %>%
ggplot(aes(reorder(word, freq), freq)) +
geom_col(fill = '#742014') +
coord_flip() +
geom_text(aes(label = freq), hjust = -0.5, colour = 'gray', size = 15) +
labs(title = "Most common words in Epica's The Phantom Agony",
x = '',
y = '',
caption = str_wrap("Note: Lyrics data obtained from the Genius API through the geniusr package. Tokenization of the text performed with the tidytext package.", 116),
subtitle = str_wrap('The album combines the themes of introspection with political and social statements', 100)) +
theme_daniel +
theme(axis.text.y = element_text(colour = 'gray'),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank())
```
Epica's lyrics usually revovle around intangible topics which motivate self-reflection, which is something strongly evident with words such as *life*, *die*, *agony*, *forever*. However, in their first album, the band strongly tackled social issues, namely, the misuse of religion for the personal benefit of leaders. Epica commonly uses series of songs across different albums which approach common topics, and in *The Phantom Agony*, the first series for the band started: *The Embrace That Smothers*[^4] (i.e. religion being the embrace?). The world events of the time (911) likely motivated Jansen's literal rather than figurative approach to the topic.[^5].
[^4]: Technically, this series started before Epica existed. Mark Jansen's previous band, After Forever, started the series and Epica finished it.]
[^5]: For instance, *Seif al Din* contains a lengthy and clear statement on how *most leaders interpret the old words to their advantage* (Epica, 2003 on [Genius](https://genius.com/Epica-seif-al-din-the-embrace-that-smothers-part-vi-lyrics), pp. 3)[@epica].
*The Embrace that Smothers* continued on for nine songs across three albums. However, this series, in my view, is best exemplified by *The Phantom Agony*, which contains the bulk of the series. With time, even though the religion theme remains highly relevant, Epica started approaching more abstract themes in a less literal way. The lyrics leaned on works by social and natural scientists, ancient cultures, among others. The seamless interface of the *geniusr* package allowed me to easily compare the first album to the rest of the discography, as seen below:
```{r}
#| label: all-discography-graph-analysis
#| cache: true
# Make it phantom agony vs. all of the other albums
# Get all album id's except TPA, first creating a vector of search terms to search for all albums
epica_album_search_terms <-
epica_album_metrics$release_name %>% unique()
epica_album_search_terms <-
epica_album_search_terms[epica_album_search_terms != 'The Phantom Agony']
# Get the album tracks with their genius id's
almost_all_epica_tracks <-
lapply(epica_album_search_terms, get_album_tracklist_search, artist = 'Epica') %>%
bind_rows() %>%
inner_join(get_artist_songs_df(epica_id) %>% select(song_id, song_name),
by = c('song_title' = 'song_name'))
# Get all song dataframes to get albums
song_dfs <-
lapply(almost_all_epica_tracks$song_id, get_song_df) %>%
bind_rows() %>%
select(song_id, song_name, album_name, album_id) %>%
mutate(album_name = str_trim(album_name) %>% noquote())
# Get the lyrics
almost_all_epica_lyrics <-
lapply(song_dfs$song_id, get_lyrics_id) %>%
bind_rows()
# Tokenize and antijoin with the stop words dictionary
almost_all_epica_words <-
almost_all_epica_lyrics %>%
unnest_tokens(input = line,
output = word) %>%
inner_join(song_dfs %>% select(song_id, album_name),
by = 'song_id') %>%
anti_join(stop_words, by = 'word')
# Word count
almost_all_epica_word_count <-
almost_all_epica_words %>%
count(word) %>%
arrange(desc(n))
# Make it phantom agony vs. all of the other albums
df_comparison <-
almost_all_epica_word_count %>%
rename(count_disc = 'n') %>%
full_join(phantom_agony_wordcount %>% rename(count_tpa = 'freq'), by = 'word') %>%
mutate(count_disc = replace_na(count_disc, 0),
count_tpa = replace_na(count_tpa, 0),
count_total = count_disc + count_tpa) %>%
arrange(desc(count_total)) %>%
top_n(15) %>%
mutate(rank = row_number()) %>%
gather(type,
count,
count_tpa,
count_disc) %>%
mutate(type = case_when(
type == 'count_tpa' ~ 'The Phantom Agony',
type == 'count_disc' ~ 'Rest of discography'
))
```
```{r}
#| label: comparison-graph
#| fig-width: 12
#| fig-height: 8
df_comparison %>%
ggplot(aes(reorder(word, rank, decreasing = T), count, fill = type)) +
geom_col(position = 'stack') +
coord_flip(ylim = c(0,200)) +
geom_text(data = df_comparison %>% filter(type == 'Rest of discography'),
aes(label = count_total),
size = 12,
color = 'gray',
hjust = -2.2) +
labs(title = "Most common words in Epica's Discography",
y = '',
x = '',
fill = '',
caption = str_wrap("Note: Includes word counts of The Phantom Agony and the rest of the Epica discography. Lyrics data obtained from the Genius API through the geniusr package. Tokenization of the text performed with the tidytext package.", 116)) +
guides(fill = guide_legend(ncol = 2)) +
scale_fill_manual(values = c("The Phantom Agony" = "#742014", "Rest of discography" = "#C3714A")) +
theme_daniel +
theme(axis.text.y = element_text(colour = 'gray'),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = 'top',
legend.justification = c(-0.25, 0))
ggsave('textanalysis1.png',
width = 8,
height = 6)
```
As expected given the retrospective nature of the band's work, *life* remains as the top word found in lyrics across all original releases, yet other words, such as *truth*, *matter*, *conquer* which are more literal, disappear from the band's top 15 word count. It is clear how the theme of self-reflection is present in all releases.
## Sentiment analysis
A painful truth about metal (at least in Latin America) is that it often faces resistance from people who relate the genre to negativeness. When Marilyn Manson visited Mexico in 2003, for instance, [Christian organizations held prayer sessions to "counter the satanist effects of the artist"](https://www.eluniverso.com/2003/11/05/0001/259/A4B577F8FED84DB29FD4109E384B70C3.html/)^[Seemingly, this also happened in Hungary and Italy at around the same time.]. Though Epica has not faced these types of accusations, for many people it would still fall under the broad spectrum of metal, which is supposed to be charged with negative energy. How true is this?
It is possible to attempt an answer to this question through *sentiment analysis*. In a nutshell, one uses dictionaries or lexicons which contain thousands of words along with a connotation (could be negative, energetic, etc.) to compare with the text of choice. Below, I use the Bing lexicon to categorize *The Phantom Agony's* lyrics into either positive or negative.
```{r}
#| label: sentiment-bing
#| fig-width: 12
#| fig-height: 8
bing_lexicon <-
get_sentiments('bing')
bing_wordcount <-
phantom_agony_wordcount %>%
inner_join(bing_lexicon, by = 'word') %>%
group_by(sentiment) %>%
arrange(desc(freq)) %>%
slice_head(n = 10)
bing_wordcount %>%
ggplot(aes(reorder(word, freq), freq, fill = sentiment)) +
facet_wrap(~ sentiment, scales = 'free_y') +
geom_text(aes(label = freq),
size = 12,
color = 'gray',
hjust = -1) +
geom_col() +
coord_flip() +
labs(y = '',
x = '',
title = "The Phantom Agony's Sentiment Analysis",
subtitle = 'Using the Bing lexicon to classify words into a positive/negative dichotomy\n',
caption = str_wrap("Note: Includes word counts of The Phantom Agony. Lyrics data obtained from the Genius API through the geniusr package. Tokenization of the text performed with the tidytext package. Sentiment analysis used the Bing lexicon from the tidytext package.", 116)) +
guides(fill = 'none') +
geom_text(aes(label = freq), colour = 'gray', hjust = 0.5) +
theme_daniel +
theme(axis.text.x = element_blank(),
axis.text = element_text(colour = 'grey', hjust = 1))
ggsave('textanalysis2.png',
width = 8,
height = 6)
```
This initial analysis shows that the top words in the album are indeed categorised as negative, and in fact over 70% of the album's words are so negative. It is important to understand that this type of sentiment analysis does not perform well with complicated semantic constructions, such as sarcasm or metaphors. For instance, if a song were to constantly mention *suffering*, but the song is about getting over *suffering* to reach happiness, the bing lexicon would still consider the song as negative. The stringent dichotomy of positive-negatively is structurally flawed; a word's connotation is not independent from its context. To partially solve this, there are other lexicons, the *Loughran* being one that has more than just positive or negative dimensions. I apply this lexicon to the band's lyrics below:
```{r}
#| label: sentiment-over-time
#| include: false
#| cache: true
epica_all_ids <-
lapply(epica_album_metrics$release_name %>% unique(),
get_album_tracklist_search,
artist = 'Epica') %>%
bind_rows() %>%
inner_join(get_artist_songs_df(epica_id) %>% select(song_id, song_name),
by = c('song_title' = 'song_name')) %>%
select(song_id)
epica_all_song_dfs <-
lapply(epica_all_ids$song_id, get_song_df) %>%
bind_rows() %>%
select(song_id, song_name, album_name, album_id) %>%
mutate(album_name = str_trim(album_name) %>% noquote())
epica_all_words <-
epica_all_song_dfs$song_id %>%
lapply(get_lyrics_id) %>%
bind_rows()
epica_tokenised <-
epica_all_words %>%
unnest_tokens(input = line,
output = word) %>%
inner_join(epica_all_song_dfs %>% select(song_id, album_name),
by = 'song_id') %>%
anti_join(stop_words, by = 'word') %>%
inner_join(bing_lexicon, by = 'word')
epica_sentiment_releases <-
epica_tokenised %>%
group_by(album_name, sentiment) %>%
summarise(freq = n()) %>%
mutate(perc = freq/sum(freq),
album_name = as.character(album_name)) %>%
inner_join(epica_releases_year,
by = c('album_name' = 'release_name'))
# Graph
epica_sentiment_releases %>%
ggplot(aes(reorder(album_name, release_year), freq, fill = sentiment)) +
geom_col(position = 'stack')
```
```{r}
#| label: sentiment-nrc
#| fig-width: 12
#| fig-height: 8
nrc <- get_sentiments('nrc')
epica_tokenised <-
epica_all_words %>%
unnest_tokens(input = line,
output = word) %>%
inner_join(epica_all_song_dfs %>% select(song_id, album_name),
by = 'song_id') %>%
anti_join(stop_words, by = 'word') %>%
inner_join(nrc, by = 'word')
epica_nrc_tpa <-
epica_tokenised %>%
filter(album_name == 'The Phantom Agony') %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
mutate(type = 'The Phantom Agony')
epica_nrc_all_but_tpa <-
epica_tokenised %>%
filter(album_name != 'The Phantom Agony') %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
mutate(type = 'Rest of discography')
epica_nrc_counts <-
epica_nrc_tpa %>%
bind_rows(epica_nrc_all_but_tpa)
epica_nrc_counts %>%
ggplot(aes(reorder(sentiment, freq), freq, fill = type)) +
geom_col(position = 'stack')+
coord_flip() +
labs(x = '',
y = '',
title = "Sentiment Analysis of all Epica's discography",
subtitle = "Using the NRC lexicon to augment Epica's sentiment analysis",
caption = str_wrap("Note: Includes word counts of The Phantom Agony. Lyrics data obtained from the Genius API through the geniusr package. Tokenization of the text performed with the tidytext package. Sentiment analysis used the NRC lexicon from the tidytext package.", 116)) +
guides(fill = 'none') +
theme_daniel +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = 'top',
legend.justification = c(-0.25, 0))
```
Aha! A lexicon which is able to categorise sentiments in a richer manner finds that while Epica is predominantly "negative", "positive" words are the second most common. Note that the NRC lexicon could categorise a word in two different sentiments, so this graph is counting *matches* of the words with the lexicon rather than unique word counts themselves: the words in sadness, anger and disgust are already contained in the negative sentiment, for instance.
While the argument against nosy relatives judging the young metalhead is far from over, the data here shows two things. The richer the lexicon, the richer the conclusions that can be obtained. Having a binary approach is not useful. Second, and probably most important: lexicons kind of suck with music, as they expect literal meanings only. The few conclusions that could be extracted from sentiment analysis have to be carefully vetted through a thorough, qualitative-ish analysis.
# Some final remarks
Though I also feel that this painful, pseudo-scientific attempt at understanding Epica's lyrics has run its course, a few things are worth mentioning.
While not much is clear, one thing is: as times have changed, Epica has changed too, but it has still remained true to its beginnings. *The Phantom Agony* set the scene for what what came afterwards in terms of lyric content and musicality, and that is why it remains as one of the most streamed releases for the band. Epica remains strongly energic, but has reduced its degree of instrumentality to favour more modern types of metal in their latest releases.
It is difficult to quantify lyrical content, but through tidy text analysis I was able to see how Epica has somehow shifted towards less literal content as time has passed. With *The Phantom Agony*, words with more literal, earthly meanings were common than other albums. Sentiment analysis yields that Epica uses mostly negative words, but it was discussed how this may not necessarily mean that Epica itself is a "negative" band.
Why care about this? Well, all music genres change, and metal is no different. Understanding how change takes place makes us better listeners of music and less susceptible to the fear-mongering that every now and then takes place (i.e. is metal dead?). Further, while empirical analysis of musical data might not be of interest to those who are not fans, the exercise itself is valid in the context of reproducibility. Creating a scientific environment for the analysis of music data in an open source language like R helps researchers and industry professionals assess questions of broader impact while worrying less about developing the environment from scratch.