## Dataset Description 

https://github.com/walkerkq/musiclyrics

Billboard has published a Year-End Hot 100 every December since 1958. The chart measures the performance of singles in the U.S. throughout the year. Using R, I’ve combined the lyrics from 50 years of Billboard Year-End Hot 100 (1965-2015) into one dataset for analysis. You can download that dataset here.

The songs used for analysis were scraped from Wikipedia’s entry for each Billboard Year-End Hot 100 Songs (e.g., 2014). This is the year-end chart, not weekly rankings. Many artists have made the weekly chart but not the final year end chart. The final chart is calculated using an inverse point system based on the weekly Billboard charts (100 points for a week at number one, 1 point for a week at number 100, etc).

I used the xml and RCurl packages to scrape song and artist names from each Wikipedia entry. I then used that list to scrape lyrics from sites that had predictable URL strings (for example, metrolyrics.com uses metrolyrics.com/SONG-NAME-lyrics-ARTIST-NAME.html). If the first site scrape failed, I moved onto the second, and so on. About 78.9% of the lyrics were scraped from metrolyics.com, 15.7% from songlyrics.com, 1.8% from lyricsmode.com. About 3.6% (187/5100) were unavailable.

The dataset features 5100 observations with the features rank (1-100), song, artist, year, lyrics, and source. The artist feature is fairly standardized thanks to Wikipedia, but there is still quite a bit of noise when it comes to artist collaborations (Justin Timberlake featuring Timbaland, for example). If there were any errors in the lyrics that were scraped, such as spelling errors or derivatives like "nite" instead of "night," they haven't been corrected.

Full analysis can be found here.

# Imports and Data Loading

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import pickle

from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize

from Levenshtein import distance as levenshtein_distance

import matplotlib.pyplot as plt

### Lyrics Top-100 dataset

In [None]:
data_file_incomplete = "datasets/billboard_lyrics_1964-2015.csv"
data_file = "datasets/billboard_full.csv"

df_incomplete = pd.read_csv(data_file_incomplete, encoding = "ANSI") # utf-8 encoding doesn't work somehow :(
df = pd.read_csv(data_file, index_col=0, header=0, sep=",") 

df.head()

In [None]:
len(df.Artist.unique())

## Number of years in Top 100 per Song

In [None]:
df_count = df.groupby(["Artist", "Song"]).Year.agg(list).to_frame()
df_count["Count"] = df_count.Year.apply(len)
df_count = df_count.sort_values("Count", ascending = False)
df_count.head()

In [None]:
plt.pie(df_count["Count"].value_counts(), labels = [1, 2], autopct='%1.1f%%')
plt.show()

## Genre

In [None]:
vc_genre = df.Genre.value_counts()
vc_genre = vc_genre[vc_genre > 70] # Filter very unfrequent

plt.figure(figsize=(10,6))
plt.pie(vc_genre.values, labels = vc_genre.index, autopct='%1.1f%%')
# plt.savefig("images/genre_distribution.png")
plt.show()

## Number of songs in top 100 per artist (if a song is twice, is counted twice)

In [None]:
df.Artist.value_counts().describe()

In [None]:
df_songCounts = df.groupby("Artist").Song.count()
df_rndArtist = df_songCounts.to_frame().reset_index().groupby("Song").agg(list)
df_rndArtist["Artist"] = df_rndArtist["Artist"].apply(lambda a : np.random.choice(a, 1)[0])

df_labels = pd.DataFrame(range(1, df_songCounts.max() + 1), columns = ["Song"])
df_labels["Artist"] = ""
df_labels = df_labels.set_index("Song")
df_labels.update(df_rndArtist)

In [None]:
tmp = df_songCounts.value_counts()
tmp[tmp.index <= 3].sum() / tmp.sum()

In [None]:
bins = range(1, df_songCounts.max() + 1)
plt.figure(figsize=(12,6))
plt.hist(df_songCounts, bins = bins)
plt.xticks(bins, df_labels["Artist"], rotation='vertical')
plt.yscale("log")
plt.title("Number of times appearing in Top-100 per Artist, with randomly selected artist per bin")
# plt.savefig("images/songs_per_artist.png", bbox_inches='tight')
plt.show()

## Lyrics statistics

In [None]:
lyrics_lengths = df.Lyrics.apply(lambda s : len(s.split(" ")))

bins = range(1, 1000)
plt.hist(lyrics_lengths, bins = bins)
plt.title("Number of words per song")
plt.savefig("words.png")
plt.show()

In [None]:
lyrics_lengths.describe()

In [None]:
lyrics_lengths_unique = df.Lyrics.apply(lambda s : len(set(s.split(" "))))

bins = range(1, 1000)
plt.hist(lyrics_lengths_unique, bins = bins)
plt.title("Number of unique words per song")
plt.savefig("unique_words.png")
plt.show()

In [None]:
lyrics_lengths_unique.describe()

### Clusterisation of texts

In [None]:
from nltk import pos_tag, sent_tokenize, wordpunct_tokenize
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) 

def lemmatize(token, pos_tag):
    tag = {
        'N': wn.NOUN,
        'V': wn.VERB,
        'R': wn.ADV,
        'J': wn.ADJ
    }.get(pos_tag[0], wn.NOUN)
    return lemmatizer.lemmatize(token, tag)

def preprocess_lyrics(lyrics):
    tagged_tokens = pos_tag(wordpunct_tokenize(lyrics))
    preprocessed = [lemmatize(token, tag) for (token, tag) in tagged_tokens if not token in stop_words]
    return " ".join(preprocessed)

In [None]:
preprocessed_lyrics = df.Lyrics.fillna("").apply(preprocess_lyrics)

In [None]:
# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(stop_words='english', preprocessor = None)
# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(preprocessed_lyrics)

In [None]:
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        
# Tweak the two parameters below
number_topics = 5
number_words = 4
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)

In [None]:
vectors_10 = lda.transform(count_data)

In [None]:
plt.scatter(vectors_10[:100, 0], vectors_10[:100, 1], color = "red")
plt.scatter(vectors_10[5000:, 0], vectors_10[5000:, 1], color = "blue")

plt.show()

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(2)
vectors_2 = svd.fit_transform(vectors_10)

In [None]:
plt.scatter(vectors_2[:100, 0], vectors_2[:100, 1], color = "red")
plt.scatter(vectors_2[5000:, 0], vectors_2[5000:, 1], color = "blue")

plt.show()

## Using Doc2Vec

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc.split(" "), [i]) for i, doc in enumerate(df.Lyrics.fillna(""))]
model = Doc2Vec(documents, vector_size=300, window=4, min_count=1, workers=4, epochs = 10, dbow_words = 1)

In [None]:
sentence = "I cant get no satisfaction"
vector = model.infer_vector(sentence.split(" "))
documents[model.docvecs.most_similar([vector])[1][0]]

In [None]:
documents[4480]

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(2)
vectors_2 = svd.fit_transform(model.docvecs.vectors_docs)

In [None]:
plt.scatter(vectors_2[:100, 0], vectors_2[:100, 1], color = "red")
plt.scatter(vectors_2[5000:, 0], vectors_2[5000:, 1], color = "blue")

plt.show()

## Events

In [None]:
import pandas as pd
import spacy
from spacy import displacy

from tqdm import tqdm_notebook
tqdm_notebook().pandas()

nlp = spacy.load("en_core_web_sm")

In [None]:
event_filepath = "datasets/events_full.csv"

df_events = pd.read_csv(event_filepath, index_col=0)
df_events.xs(0).Content

In [None]:
df_events.head()

In [None]:
def entity_extractor(cols, song = False):
    """
    Parametrable extraction
    """
    ignore_entities = ["CARDINAL", "MONEY", "ORDINAL", "QUANTITY", "TIME"]
    def extract_entities(row):
        """
        Actual extraction
        """
        entities = []
        for col in cols:
            if type(row[col]) == str:
                entities += [(ent.text, ent.label_) for ent in nlp(row[col]).ents if ent.label_ not in ignore_entities]
                
        if song : 
            entities.append((row["Artist"], "PERSON"))
            entities.append((row["Song"], "WORK_OF_ART"))
        
        return entities
    
    return extract_entities

In [None]:
extraction_cols = ["Content", "Summary"]
df_events["Entities"] = df_events.progress_apply(entity_extractor(extraction_cols), axis = 1)

In [None]:
df["Entities"] = df.progress_apply(entity_extractor(["Lyrics"]), axis = 1)

In [None]:
def find_refs(song_rows):
    refs = []
    for entity, label in song_rows.Entities:
        for i, row in df_events.Entities.iteritems():
            ents_lower = [ent.lower() for ent, lab in row]
            if any([entity in low_ent or low_ent in entity for low_ent in ents_lower]):
                refs.append(i)
    return refs

In [None]:
"""
Add references to events in songs
"""
df["Refs"] = df.progress_apply(find_refs, axis = 1)
df["Refs"] = df["Refs"].apply(set)

In [None]:
df.filteredRefs

In [78]:
"""
Add references to songs in events
"""
df_events["filteredRefs"] = [[] for i in range(len(df_events))]
for i_song, refs in tqdm_notebook(df["filteredRefs"].iteritems()):
    for i_event in refs:
        df_events.iloc[i_event]["filteredRefs"].append(i_song)
        
df_events["filteredRefs"] = df_events["filteredRefs"].apply(set)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
"""
Add references to songs in events
"""
df_events["Refs"] = [[] for i in range(len(df_events))]
for i_song, refs in tqdm_notebook(df["Refs"].iteritems()):
    for i_event in refs:
        df_events.iloc[i_event]["Refs"].append(i_song)
        
df_events["Refs"] = df_events["Refs"].apply(set)

In [None]:
df["Refs"].apply(len)

In [None]:
df.xs(4971).Lyrics

In [None]:
s = dict()
for ents in df.Entities.apply(lambda x : [e[0] for e in x]).values:
    for ent in ents :
        if ent in s:
            s[ent] +=1
        else :
            s[ent] = 1

In [None]:
s = {k: v for k, v in sorted(s.items(), key=lambda item: item[1], reverse=True)}

In [None]:
df_events[df_events.Wikipedia.isna()]

In [None]:
df_events[(df_events.Wikipedia.isna()) & df_events.Content.str.contains("Anniversary")].Content

In [None]:
df_events[(df_events.Year == 2001)& (df_events.Month == "September")]

In [None]:
df_events.xs(811).Entities

In [None]:
len(df_events.xs(811).Refs)

In [None]:
df [(df.Year > 2001) & (df.Lyrics.str.contains("twin"))]

In [None]:
"""
See the all the types of entities recognized
"""

ent_types = set()
for s in df["Entities"].apply(lambda x : set([e[1] for e in x])):
    for ent in s:
        ent_types.add(ent)
ent_types

In [None]:
df.to_csv("songs_with_refs.csv")
df_events.to_csv("events_with_refs.csv")

In [84]:
with open("datasets/songs_with_refs.pickle", "wb") as f :
    pickle.dump(df, f, pickle.HIGHEST_PROTOCOL)
    
with open("datasets/events_with_refs.pickle", "wb") as f :
    pickle.dump(df_events, f, pickle.HIGHEST_PROTOCOL)

# Load refs

In [2]:
with open("datasets/songs_with_refs.pickle", "rb") as f :
    df = pickle.load(f)
    
with open("datasets/events_with_refs.pickle", "rb") as f :
    df_events = pickle.load(f)

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

all_docs = np.concatenate((df.Lyrics.fillna(""), df_events.Content.fillna("")))

documents = [TaggedDocument(doc.split(" "), [i]) for i, doc in enumerate(all_docs)]
model = Doc2Vec(documents, vector_size=300, window=4, min_count=1, workers=4, epochs = 10, dbow_words = 1)

In [None]:
lyrics2events_sim = []
for i in tqdm_notebook(range(len(df))):
    sims = model.docvecs.most_similar(positive = [model.docvecs[i]], topn=100, clip_start=len(df))[1:]
    sims = [x for x in sims if (x[0] - len(df)) in df.xs(i).Refs]
    sims_refs = sorted(sims, key = lambda x : x[1], reverse=True)[:10]
    lyrics2events_sim.append(sims_refs)
df["Similar"] = lyrics2events_sim

In [None]:
df[df.Similar.apply(len) > 0].head()

## BERT

In [3]:
from transformers import AlbertModel, AlbertTokenizer
from tqdm import tqdm_notebook
import torch

tokenizer = AlbertTokenizer.from_pretrained('albert-large-v2')
model = AlbertModel.from_pretrained('albert-large-v2')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
if torch.cuda.is_available():  
    dev = "cuda:0"
    model.cuda()
else:  
    dev = "cpu"  
device = torch.device(dev)  
None

In [5]:
df["word_count"] = df["Lyrics"].apply(lambda x : len(x.split(" ")))

In [6]:
lyrics_tokenized = torch.Tensor(tokenizer.batch_encode_plus(df.Lyrics,
                                                            max_length =512,
                                                           pad_to_max_length=True,
                                                           padding_side = "right",
                                                           add_special_tokens=True)["input_ids"]).long()

In [11]:
lyrics_vectorized = []
batch_size = 8
with torch.no_grad():
    for i in tqdm_notebook(range(0, len(df), batch_size)):
        batch = lyrics_tokenized[i: min(i + batch_size, len(df))].to(device)
        lyrics_vectorized.append(model(batch)[1].tolist())
        del batch
        torch.cuda.empty_cache()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=640.0), HTML(value='')))




In [45]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

In [21]:
vects1 = df["vect"][:5].tolist()
vects2 = df_events["vect"][:2].tolist()

In [27]:
sim = cosine_similarity(df["vect"].tolist(), df_events["vect"].tolist())

In [46]:
sim2 = euclidean_distances(df["vect"].tolist(), df_events["vect"].tolist())

In [71]:
def filterRefs(row):
    filteredRefs = []
    for ref in row["Refs"]:
        if row["sims"][ref] > 0.94:
            filteredRefs.append(ref)
    return filteredRefs

df["filteredRefs"] = df.apply(filterRefs, axis = 1)

In [73]:
print(df["filteredRefs"].apply(len).max())
print(df["filteredRefs"].apply(len).mean())

577
10.458577569363033


In [76]:
df["filteredRefs"].apply(len).value_counts()

0      3173
1       338
2       252
3       130
4       119
       ... 
167       1
363       1
106       1
122       1
455       1
Name: filteredRefs, Length: 201, dtype: int64

In [83]:
print(df_events["filteredRefs"].apply(len).max())
print(df_events["filteredRefs"].apply(len).mean())
print(df_events["Refs"].apply(len).max())
print(df_events["Refs"].apply(len).mean())

479
48.00627802690583
1557
310.247533632287


In [82]:
print(df["filteredRefs"].apply(len).max())
print(df["filteredRefs"].apply(len).mean())
print(df["Refs"].apply(len).max())
print(df["Refs"].apply(len).mean())

577
10.458577569363033
1083
67.59007424775302


In [13]:
lyrics_vectors = []
for batch_lyrics in lyrics_vectorized:
    for lyric in batch_lyrics:
        lyrics_vectors.append(lyric)
        
df["vect"] = lyrics_vectors

In [None]:
df_events["text_vect"] = df_events.Summary.fillna(df_events.Content)
df_events["word_count"] = df_events["text_vect"].apply(lambda x : len(x.split(" ")))
df_events["text_vect"] = np.where(df_events["word_count"] > 320, df_events.Content, df_events["text_vect"])
df_events["word_count"] = df_events["text_vect"].apply(lambda x : len(x.split(" ")))

In [None]:
events_tokenized = torch.Tensor(tokenizer.batch_encode_plus(df_events.text_vect,
                                                           pad_to_max_length=True,
                                                           padding_side = "right",
                                                           add_special_tokens=True)["input_ids"]).long()

In [None]:
with open("events_vects.pickle", "rb") as f:
    tmp = pickle.load(f)

In [None]:
with open("events_vects.pickle", "wb") as f:
    pickle.dump(events_vectorized, f, pickle.HIGHEST_PROTOCOL)

In [None]:
events_vectorized = []
for i in tqdm_notebook(df_events.text_vect):
    token_vec = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)
    events_vectorized.append(model(token_vec)[1])

In [None]:
df[df.Artist.str.contains("dire straits")]

In [None]:
events_vectorized = []
batch_size = 8
with torch.no_grad():
    for i in tqdm_notebook(range(0, len(df_events), batch_size)):
        batch = events_tokenized[i: min(i + batch_size, len(df_events))].to(device)
        events_vectorized.append(model(batch))
        del batch
        
events_vectorized = [x[1] for x in events_vectorized]

In [None]:
events_vectors = []
for batch_event in events_vectorized:
    for event in batch_event:
        events_vectors.append(event.tolist())
        
df_events["vect"] = events_vectors

In [None]:
df_events

In [None]:
df_events.head()

In [None]:
df_events.Refs.head(20)

In [None]:
# df["text_vect"] = df.Summary.fillna(df_events.Content)
# df["word_count"] = df["Lyrics"].apply(lambda x : len(x.split(" ")))
# df["text_vect"] = np.where(df["word_count"] > 400, df_events.Content, df_events["text_vect"])
# df["word_count"] = df_events["text_vect"].apply(lambda x : len(x.split(" ")))

In [None]:
tmp = [ for sent in df_events["text_vect"]]
x = torch.tensor(tmp)

## Export tsv

In [None]:
len(model.docvecs.vectors_docs)

In [None]:
pd.DataFrame(model.docvecs.vectors_docs).to_csv("vec.tsv", index = False, sep = "\t", header=False)

In [None]:
df.to_csv("data.tsv", sep = "\t", index = False)