# Music Sentiment Analysis
by Ben Pfeffer

## Intro 
What emotions are expressed most commonly in the lyrics of popular music in the US? What emotions are less popular? Are different emotions correlated with one another? In this project I will analyze the emotional content of music lyrics in the popular music of the last 50 years. 

Keep in mind that I am only analyzing the emotional content of the *lyrics*, not the *melodies* in this project. Analyzing music while totally disregarding the melody is at best incomplete, so take everything you read here with a grain of salt. Sometimes the missing melody may be [amplifying](https://www.youtube.com/watch?v=4fWyzwo1xg0) the lyrical emotions explored here. Other times the tune [contrasts](https://www.youtube.com/watch?v=unfzfe8f9NI) with the lyrics, acting as an emotional counterpoint or overriding the lyrics altogether. There is lots of research being done in the interplay between [music and lyrics](https://www.researchgate.net/publication/247733383_Songs_and_emotions_Are_lyrics_and_melodies_equal_partners) that warrants further investigation. I think this project can provide valuable insight into what emotions Americans are explicitly indulging in through the musical lyrics, but even in the best case scenario this is only half the picture. 

This project was inspired by two data science projects I read about on the internet. The first is Greg Rafferty's [Sentiment Analysis on the Texts of Harry Potter](https://towardsdatascience.com/basic-nlp-on-the-texts-of-harry-potter-sentiment-analysis-1b474b13651d). The second is John W. Miller's [Trucks and Beer 🍺](https://www.johnwmillr.com/trucks-and-beer/). I decided to do a bit of a combination of the two, performing a similar sentiment analysis as Rafferty, but using Miller's Genius API wrapper to scrape lyrics from many musical artists. 

 The Billboard Hot 100 Artists is a year-end list of the most successful and prominent artists of the year. It measures record sales, radio airtime, and (for newer artists) internet streaming in the US. The lists for all years ranging between 1970 and 2019 are available online. For this reason I believe these lists will allow us to build a good sample of the most popular music in the US. One advantage to this approach is that it will allow us to get historical data from artists back before streaming was popular (Spotify wouldn't be able to tell us who people were listening to in the 70's). One limitation to this approach is that it only gives us data about the very top of the pyramid. There is still tons of music that is very popular in the US made by artists who have never made a Billboard Hot 100 Artists list. There is a risk that the emotional content of many of those artist's lyrics is significantly different than the emotional content of the artists at the top. 


 ## Data Wrangling 
 To begin, I scraped the Billboard Hot 100 artists for each year from 1970 to 2019 using the BeautifulSoup package. Then I deleted all the duplicates (for example, Michael Jackson was on the list for much of the 80's). I was left with a list of 1540 musical artists who were popular in the US for at least one year between 1970 and 2019. Some are staples of popular music, while others are more one hit wonders. The code for this section is [here](https://github.com/ben-pfeffer/lyrics-sentiment/blob/master/scrape-artists.py). 
 
 Next, I wanted to gather the lyrics for the most popular songs by each artist. I used John W. Miller's Genius API wrapper to search for songs by each of the 1540 artists. I limited my search to the 25 most popular songs by each artist, as determined by the Genius algorithm. Many artists have less than 25 songs available, but for expediency's sake I did not search for every song by artists with more than 25 songs. The lyrics are cleaned to remove punctuation, stop words (words like "a," "the," "of," "and" etc), and convert everything to lower case. The code for this is [here](https://github.com/ben-pfeffer/lyrics-sentiment/blob/master/gather-lyrics.py).  
 
 It was convenient to write the previous section as a JSON file, but I wanted to work with a data frame, so I converted the JSON file to a CSV. The code for that piece is [here](https://github.com/ben-pfeffer/lyrics-sentiment/blob/master/json-to-csv.py).
 

## Methodology 
 Now that I have the artist and lyrics in a nicely formatted data frame, it is time to do the sentiment analysis. I used the [NRC Emotion Lexicon](https://nrc.canada.ca/en/research-development/products-services/technical-advisory-services/sentiment-emotion-lexicons) to ascribe emotions to musical lyrics. The NRC Lexicon has word-emotion associations for eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as a general positive or negative sense. For example, the word "vindictive" is associated with "anger," "disgust," and "negative." The word "fondness" is associated with "joy" and "positive." Each word is scored either a 0 (not associated) or a 1 (associated) with each emotion. The lexicon has over 25,000 words classified in this way.
 
 For an artist in the data frame, I gave them a scorecard for each emotion. I looked at each word in the list of lyrics and cross referenced the NRC lexicon for emotional associations. Any time there was a match I added a point to the appropriate emotion(s) for the artist. This process is repeated for each artist in the data frame. Since different artists have very different word counts, I divide the artist's score for each emotion by their total word count. This gives us a percentage of the artist's lyrics (after stop words are removed) associated with each emotion in the NRC lexicon. We also calculate the number of unique words and the lyrical diversity (unique words/total words) for each artist. 
 
 The code for this section is [here](https://github.com/ben-pfeffer/lyrics-sentiment/blob/master/sentiment-analysis.py). We started with 1540 artists and searched for lyrics related to each of them. Some did not return anything, and we dropped those artists. We are left with 1463 artists in our final data frame. Below is the first five rows of our final dataset. 
 

In [10]:
import pandas as pd
df = pd.read_csv('sentiment-data-clean.csv', index_col='Artist')
df = df.drop('index', axis=1)
df.head().round(3)

Unnamed: 0_level_0,Lyrics,Word Count,Unique Words,Lyrical Diversity,positive,negative,anger,anticipation,disgust,fear,joy,sadness,surprise,trust
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
O.T. Genasis,love coco coco love coco coco got low low love...,6304,1479,0.235,0.068,0.093,0.084,0.048,0.039,0.047,0.051,0.047,0.043,0.053
Dean Lewis,look ground see sad teary eyes look away see s...,3331,564,0.169,0.063,0.091,0.034,0.048,0.014,0.045,0.054,0.053,0.03,0.022
Edwin Starr,war huh yeah good absolutely nothing uhuh war ...,3359,749,0.223,0.108,0.057,0.017,0.056,0.013,0.046,0.08,0.032,0.031,0.062
Cali Swag District,like smoove teach dougie know cause bitches lo...,7635,1676,0.22,0.08,0.101,0.085,0.035,0.06,0.063,0.052,0.053,0.04,0.055
The Marshall Tucker Band,gonna take freight train station lord care goe...,2426,678,0.279,0.1,0.071,0.023,0.071,0.035,0.03,0.061,0.036,0.023,0.048


## Findings 
- more happy than sad, which emotions correlated, distributions for each emotion

The code for the images in this section can be found [here](https://github.com/ben-pfeffer/lyrics-sentiment/blob/master/sentiment-vizs.ipynb).

 One of the first visualizations I made gave me two informative outliers. The graph below plots total Word Count against Unique Words. As you would expect, artists with higher word counts tend to have more unique words. I have labeled the two outliers. 

![Outliers](pics/word-count.png)

 Taco Hemingway has more unique words than any other artist, and much more than you would expect given his wordcount. The reason for this is that Hemingway is a Polish rapper. Polish presumably has more unique words than english, perhaps because of conjugations. I used a stemmer to convert words into their base words (for example, running -> run and curtains -> curtain), but this stemmer only went looking for english words and probably made no changes to the Polish words. The emotion lexicon, likewise only recognized English words, so we expect Taco Hemingway's lyrics to show up as less emotional than average. 

In [11]:
df.sort_values('Unique Words', ascending=False).head(1)

Unnamed: 0_level_0,Lyrics,Word Count,Unique Words,Lyrical Diversity,positive,negative,anger,anticipation,disgust,fear,joy,sadness,surprise,trust
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Taco Hemingway,teraz porozmawiajmy pieniądzach zauważyliście ...,13903,4810,0.345968,0.009135,0.002733,0.003596,0.00374,0.000719,0.00151,0.00784,0.001151,0.007337,0.003668


Indeed, Hemingway's lyrics are counted as highly unemotional by the algorithm. 

Now, I have never heard of Taco Hemingway, and I was surprised a Polish rapper made it on an American hot music list. After some investigation, I found that Taco Hemingway has never been on the Hot 100 Artists list, but there was an artist called Taco who had a hit in 1983 and made the list. Taco does not appear in my data. I think what happened was that when I queried the Genius API for Taco, it gave me the 25 most popular songs by any artist with Taco in their name, and Taco Hemingway happens to be more popular than Taco from the 80's. 

In the same way, I'm very certain that Jonathan Edwards (Theologian), a theologian from the colonial era, did not make the Billboard Hot 100 in the last 50 years. There was a musical artist with the same name who had a hit in 1971. When the Genius API was queried it gave me data on the theologian, presumable because the theologian has an audiobook that is more popular than the musician's work. 

It seems that the code allows for gathering lyrics from artists that are close but not exact matches to the original artist. This is something that I would like to see if I can fix in a future version of this project, but for now I do not think it will skew the results so much that I need to re-code the project before I look at the results. We will proceed with the analysis.

But first, let's take a small detour into the writings of Jonathan Edwards (Theologian). Since the "lyrics" data returned for Edwards is a chunk of his writing, he tallied up a far higher word count than any of the musicians. Since Edwards wrote in English we can take a look at how the algorithm ranked the emotionality of his work. 

Edwards is best known for his sermon "Sinners in the Hands of an Angry God."  The sermon is famous for its vivid imagery of hell and for scaring crowds of people into a sudden fervor. He is seen as representative of colonial Calvinist theology in his descriptive focus on heaven and hell as real, physical places. Edwards is credited with being a major figure in starting the First Great Awakening in US history. 

In [12]:
df.sort_values('Word Count', ascending=False).head(1)

Unnamed: 0_level_0,Lyrics,Word Count,Unique Words,Lyrical Diversity,positive,negative,anger,anticipation,disgust,fear,joy,sadness,surprise,trust
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Jonathan Edwards (Theologian),deuteronomy xxxii 35 foot shall slide due time...,26517,4344,0.163819,0.156541,0.052645,0.023758,0.099144,0.021609,0.0712,0.087717,0.028397,0.030886,0.123921


The emotional content of his work scores highly on positivity, trust, anticipation, joy, and fear. Other negative emotions have low scores. Edward's sermons seem to be generally from human's perspective on heaven and hell, rather than from God's perspective. So it makes sense that emotions like anger, disgust, and sadness are deemphasized, as those would be more likely to be associated with God's feelings towards humans in Edward's theology. 

However, for someone who is primarily associated with his vivid hell imagery, I was surprised at how much more positivity was in his work (heaven) than negativity (hell). Overall, the emotional analysis seems like it roughly matches the themes of his work, so I take this as good news that the algorithm is doing what it set out to do. We drop Edwards from the data since he is not a musician, but keep Taco Hemingway. 

## Conclusion 
- wrap up & suggestions for further investigation
    - scrape top 100 songs per year. emotional content over time analysis
    - take one artist and emotionally classify each album & compare (eg Beyonce)
    - make sure I get the right artists & only use english lyrics (Taco Hemingway)