Step 4 '4_topic_mining': In this step, we process song lyrics to calculate the most common words for each genre, creating a unigram topic model for each. The lyrics are tokenized with optional filters to remove stopwords and invalid words, ensuring meaningful analysis. For each genre, the frequency of words is normalized to represent the probability distribution of word occurrences. Additionally, we calculate the most common words across all genres to form a background model, which represents the overall distribution of words in the dataset. The results are stored as JSON files.

In [8]:
import pandas as pd
import nltk
from nltk.corpus import stopwords, words
from collections import Counter, defaultdict
from tqdm.notebook import tqdm
import re


# Initialize counters
topic_word_counts = defaultdict(Counter)
background_word_count = Counter()

# Load CSV and drop rows with NaN in 'tag' or 'lyrics' columns
df = pd.read_csv('../data/3_ds3_cleaned.csv')
df = df.dropna(subset=['tag', 'lyrics'])

print(df.head(2))

        title   tag    artist  year  views        features  \
0  Revelation  rock  Zardonic  2018   6680              {}   
1  Robitussin    rb   OPENPAD  2017     94  {"Rossi Rock"}   

                                              lyrics       id  
0  Try to do it like this, you won't get it Try t...  3849758  
1  Saucalini:  Baby what you want, what you need?...  3387226  


In [9]:
nltk.download('words')
nltk.download('stopwords')
valid_words = set(words.words())
stop_words = set(stopwords.words('english'))

# Mangle the words coming in
def tokenize(text, remove_stopwords=False, filter_non_words=True):
    words = re.findall(r'\b[a-zA-Z]{2,}\b', text.lower())
    if filter_non_words:
        words = [word for word in words if word in valid_words]
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word not in stop_words]
    return words

# Count words by topic and overall with a progress bar
for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing rows"):
    topic, lyrics = row['tag'], row['lyrics']
    # Tokenize lyrics with stopword removal for topic counts
    topic_words = tokenize(lyrics, remove_stopwords=True)
    topic_word_counts[topic].update(topic_words)
    # Tokenize without stopword removal for background count
    background_words = tokenize(lyrics)
    background_word_count.update(background_words)

[nltk_data] Downloading package words to /home/jovyan/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Processing rows:   0%|          | 0/67606 [00:00<?, ?it/s]

In [10]:
# Calculate unigram probabilities per topic
topic_models = {}

for topic, word_counts in tqdm(topic_word_counts.items(), desc="Calculating topic models"):
    topic_models[topic] = {}
    total_topic_word_count = sum(word_counts.values())
    for word, count in word_counts.items():
        topic_models[topic][word] = count / total_topic_word_count
        

# Display Result

# Loop through each topic and display the top 10 words by count ratio
for topic, words in topic_models.items():
    # Sort words by their values in descending order and get the top 10
    top_words = sorted(words.items(), key=lambda x: x[1], reverse=True)[:10]
    
    # Print topic and the top 10 words with their values
    print(f"Top 10 words for topic '{topic}':")
    for word, value in top_words:
        print(f"  {word}: {value}")
    print()  # Add a newline for readability between topics

Calculating topic models:   0%|          | 0/6 [00:00<?, ?it/s]

Top 10 words for topic 'rock':
  know: 0.01208772250939899
  like: 0.010470906784314468
  time: 0.008937193494026092
  oh: 0.008756261610672457
  love: 0.00875205389245493
  never: 0.008644757077908006
  one: 0.008274477874765682
  see: 0.007913666037612793
  go: 0.007846342546132371
  got: 0.007027941352823484

Top 10 words for topic 'rb':
  know: 0.022522515811351396
  love: 0.021878887655599342
  like: 0.017795126602552166
  yeah: 0.0166538785300844
  baby: 0.016004290854371682
  oh: 0.014460775184558884
  got: 0.013589195390311311
  get: 0.011135363046506605
  let: 0.010550585150366137
  want: 0.01020195323266711

Top 10 words for topic 'rap':
  like: 0.02547880922448938
  got: 0.016341425641241903
  know: 0.014328830928229331
  get: 0.01379296304402042
  yeah: 0.011957052043168994
  bitch: 0.008511356109293782
  go: 0.007899143411188077
  see: 0.007029539626230333
  back: 0.006771421445912608
  cause: 0.006537297349962445

Top 10 words for topic 'misc':
  one: 0.008196599467922271

In [11]:
# Calculate background model probabilities
total_background_count = sum(background_word_count.values())
background_model = {word: count / total_background_count for word, count in background_word_count.items()}

# Display results
# Sort and print the top 10 words by frequency
top_10_background = sorted(background_model.items(), key=lambda x: x[1], reverse=True)[:10]
for word, frequency in top_10_background:
    print(f"{word}: {frequency:.4f}")

the: 0.0539
you: 0.0305
and: 0.0303
to: 0.0278
of: 0.0222
it: 0.0179
in: 0.0175
that: 0.0151
me: 0.0131
my: 0.0130


In [12]:
import json

topic_models_path = '../data/models/topic_models_nostopwords.json'
background_model_path = '../data/models/background_model.json'

# Save topic_models as JSON
with open(topic_models_path, 'w') as f:
    json.dump(topic_models, f, indent=4)  # `indent=4` makes it human-readable

# Save background_model as JSON
with open(background_model_path, 'w') as f:
    json.dump(background_model, f, indent=4)
