In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string

In [2]:
columns = ['Artist', 'Title', 'Album', '#_of_views', 'Release_date', '#_feat_artists', 'Artist_pop', 'Genre', 
           'Followers', 'Danceability', 'Energy', 'Valence', 'Duration', 'Loudness', '#_words', 'Lyrics']
data = pd.read_csv('lyricDataset.scv', names = columns)

Based on the greatest similarity between track titles, the data from Spotify was concatenated to the current Genius dataset. Now the dataset has additional features from Spotify. Some of the feature descriptions are grabbed straight from the Spotify API. The full list of features are as follow:<br>

1. 'Artist' - Song artist<br>
DESCRIPTION: Name of artist
2. 'Title' - Song title<br>
DESCRIPTION: Name of song
3. 'Album' - Album Title<br>
DESCRIPTION: Name of album the song is from.
4. '#_of_views' - Number of pageviews for the lyric<br>
DESCRIPTION: Amount of views the lyric page got on the Genius website.
5. 'Release_date' - Release date of song<br>
DESCRIPTION: The date the song was released
6. '#_feat_artists' - How many featured artists<br>
DESCRIPTION: The amount of artists featured on song
7. 'Artist_pop' - Artist Popularity<br>
DESCRIPTION: How popular an artist is on Spotify. Range is 0-100.
8. 'Genre' - Genre(s)<br>
DESCRIPTION: Genre the artist is considered as, separated by ' / '.
9. 'Followers'<br>
DESCRIPTION: The amount of users following an artist on Spotify
10. 'Danceability'<br>
DESCRIPTION: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. (Gotten from Spotify)
11. 'Energy'<br>
DESCRIPTION: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
12. 'Valence'<br>
DESCRIPTION: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
13. 'Duration' - Duration (in Milliseconds)<br>
DESCRIPTION: The length of the song in milliseconds
14. 'Loudness'<br>
DESCRIPTION: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
15. '#_words' - Total words in lyrics<br>
DESCRIPTION: How many words each lyric has.
16. 'Lyrics'<br>
DESCRIPTION: The lyrics for the specific song.

In [3]:
data.shape

(1580, 16)

In [4]:
data.isnull().sum()

Artist             0
Title              0
Album             21
#_of_views         0
Release_date      87
#_feat_artists     0
Artist_pop         0
Genre             20
Followers          0
Danceability       0
Energy             0
Valence            0
Duration           0
Loudness           0
#_words            0
Lyrics             1
dtype: int64

In [5]:
data = data.dropna()

In [6]:
data.shape

(1453, 16)

<br><br>
Christian, bellow is the code for calculating sentiment for lyrics, you can use it after you done with cleaning dataset of missing values, drop lines etc.<br><br>
Next three cells will convert all 'Lyrics' to string, clean them from punctuation and calculate sentiment for each lyrics. sent_result will have all sentiment coefficients<br>
Last cell will put sent_result in dataflame as a new column 'Sent'

In [4]:
 data['Lyrics'] = data['Lyrics'].astype(str)

In [5]:
str_array = data['Lyrics'].to_numpy()
length = len(str_array)
for i in range(length):
    str_array[i] = str_array[i].lower().translate(string.punctuation).replace('+', ' ').replace('\\', '').replace('"', '')

In [6]:
sent_result = np.empty(length)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#note: depending on how you installed (e.g., using source code download versus pip install), you may need to import like this:
#from vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
for i in range(length):
    sent_result[i] = analyzer.polarity_scores(str_array[i])['compound']

In [8]:
tempo = pd.Series(sent_result)
data = data.assign(Sent = tempo)