# Notebook 4.1: Data Driven Insights - How Lyrics Affect Popularity of a Song

To our astonishment, using **only lyrics** to forecast a song's popularity achieves almost `80%` accuracy in the test set, highlighting the significant influence of lyrics on a song's success. Intrigued by this finding, we opted to examine the Bert model and the **attention in lyrics** to better comprehend the relationship between lyrics and a song's appeal.


## Bert Visualization:
* [A Popular Song](#pop)
* [A Less Popular Song](#lessp)
* [Conclusion](#conc4)


In [16]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings
import numpy as np
import torch
import pandas as pd
import seaborn as sns

### Load the dataset


In [17]:
df = pd.read_csv("./positive_and_negative_one_hot.csv")
df = df.dropna()
df

Unnamed: 0,artist,year,views,features,lyrics,id,url,acousticness,danceability,duration_ms,...,key_8,key_9,key_10,key_11,tag_country,tag_misc,tag_pop,tag_rap,tag_rb,tag_rock
0,AKING,2015,4.432273e-05,{},Glorious mistakes are anxiously waiting to be ...,985583,https://open.spotify.com/track/30sr35axWFPOvmi...,0.760040,0.806517,0.144170,...,0,0,0,0,0,0,1,0,0,0
1,Filip Winther,2020,1.251733e-06,{},[Intro]\nDe-de-deluxe\n\n[Refräng]\nJag fuckar...,5097257,https://open.spotify.com/track/4mznGf6tTvHp74y...,0.020681,0.894094,0.141797,...,0,0,0,1,0,0,0,1,0,0
2,Dan Reeder,2018,1.513459e-05,{},The guy who bathes in the pond at the park\nTh...,3407076,https://open.spotify.com/track/1UbSSyqIVEkooKe...,0.993976,0.554990,0.044422,...,0,0,0,0,0,0,1,0,0,0
3,Noa Azazel,2021,1.251733e-06,{},[Pre-Chorus]\nWhen the moon is taking over i'm...,7061926,https://open.spotify.com/track/51F8whLH1Qou7iV...,0.214858,0.419552,0.169140,...,0,0,0,0,0,0,1,0,0,0
4,070 Phi,2019,2.031221e-05,{},[Chorus]\nAin't no way that you ain't eatin' w...,4241387,https://open.spotify.com/track/0mvzUwvyLT1Dm1y...,0.367469,0.695519,0.146753,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9529,mounika yadav,2021,1.257423e-05,"{""Allu Arjun"",""Rashmika Mandanna""}",నువ్ అమ్మీ అమ్మీ అంటాంటే నీ పెళ్ళాన్నైపోయినట్ట...,7552375,https://open.spotify.com/track/4ZUxhQNRCzlh6al...,0.360441,0.821792,0.161581,...,0,0,0,0,0,0,1,0,0,0
9530,d-metal stars,2016,1.706909e-07,{},[Verse 1]\nThe seaweed is always greener\nIn s...,7558599,https://open.spotify.com/track/0F8nLktPi0SgOAm...,0.000092,0.542770,0.154411,...,1,0,0,0,0,0,0,0,0,1
9531,grupo firme,2021,2.048290e-06,{Maluma},"Dejen de meterse ya, en donde no les importa\n...",7728445,https://open.spotify.com/track/5BE9B2FiFWBbBdo...,0.137549,0.719959,0.142190,...,0,0,0,0,0,0,1,0,0,0
9532,hensonn,2021,7.567295e-06,{},[Instrumental],7814578,https://open.spotify.com/track/6nqdgUTiWt4JbAB...,0.146585,0.626273,0.122640,...,0,0,0,0,0,0,0,1,0,0


### We will be using distilbert-base-uncased Model to visualize connection within lyrics

In [18]:
model_type = 'bert'
model_version = 'bert-base-uncased'
model_name = "distilbert-base-uncased"

## A Popular Song <a name = "pop"> </a>

We chose a popular song and visualized the **attention patterns** within its lyrics. The graph reveals numerous connections among the lyrics, indicating a **strong correlation and cohesion**.

**Engaging lyrics** are a key element in keeping listeners captivated by a song.

In [21]:
pos_song = df.iloc[6400]
input_text = pos_song['lyrics'][100:200]
model = AutoModel.from_pretrained(model_name, output_attentions=True) 
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt') 
outputs = model(inputs)  
attention = outputs[-1]  
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
model_view(attention, tokens)  

<IPython.core.display.Javascript object>

## Picture of result if above source cannot be loaded

![Results1](./4.1.1Result.png "Results1")


## A Less Popular Song <a name = "lessp"> </a>

In contrast, we selected a less popular song and visualized the attention patterns within its lyrics. The graph shows far **fewer connections** in comparison to the first song, suggesting that the lyrics are **less related and cohesive**. 

This comparison demonstrates the significant impact of **well-crafted lyrics** on a song's popularity.

In [22]:
pos_song = df.iloc[2000]
input_text = pos_song['lyrics'][100:200]
model = AutoModel.from_pretrained(model_name, output_attentions=True) 
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt') 
outputs = model(inputs)  
attention = outputs[-1]  
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
model_view(attention, tokens)  

<IPython.core.display.Javascript object>

## Picture of result if above source cannot be loaded

![Results2](./4.1.2Result.png "Results2")


## Conclusion <a name = "conc4"> </a>


The graph reveals that **popular songs**, on average, exhibit **more attention within their lyrics**, while less popular songs have significantly less attention. Given that BERT’s attention mechanism, this finding suggests **well-crafted lyrics** have significant influence on a song’s popularity.

This observation is particularly important for **music producers**, as they should not undermine the importance of lyrics in their songs.