# 02 Data Processing

In [1]:
import pandas as pd
from lyricsgenius import Genius
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import requests
import json
import concurrent.futures
import time
from langdetect import detect

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.lda_model

In [2]:
data = pd.read_csv('data.csv')

Cleaning steps:
- Convert to lowercase
- Remove stopwords
- Tokenize and lemmatize
- Remove punctuation and special characters

### Remove unwanted columns, normalize names

In [3]:
data = data.drop(columns=["Tempo", "Danceability", "Energy", "Valence", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Lyrics", "Unnamed: 0.1", "Unnamed: 0"], errors='ignore')
data.rename(columns={'lyrics': 'Lyrics'}, inplace=True)

In [4]:
data

Unnamed: 0,Song,Artist,Lyrics,Duration (seconds),Artist Followers,language
0,Ella Baila Sola,Eslabon Armado,"Compa, ¿qué le parece esa morra?\nLa que anda ...",165.671,6081474.0,es
1,These Words,Natasha Bedingfield,"My heart\nThese words are my own\n(Uh, yeah, u...",136.580,41112.0,en
2,"Guantanamera (feat. Ms. Lauryn Hill, Celia Cru...",Wyclef Jean,none,270.493,385395.0,it
3,Vanish Into You,Lady Gaga,Saw your face and mine\nIn a picture by our be...,244.915,35745647.0,en
4,Volver,Carmen Ferre,none,149.655,5988.0,it
...,...,...,...,...,...,...
1312,Rescue,Lauren Daigle,You are not hidden\nThere's never been a momen...,215.613,2969606.0,en
1313,Wild,Beach House,My mother said to me that I would get in troub...,219.453,2540586.0,en
1314,You Came To Me,Beach House,Invite your sister\nInto the garden\nAll canno...,245.333,2540586.0,en
1315,Chariot,Beach House,Sunny day\nIn a chariot\nWere they waving back...,316.507,2540586.0,en


## Clean lyrics for topic modeling

Noting an issue when lemmatizing: a lot of words in lyrics end in -in instead of -ing --these words will not be lemmatized correctly.

In [5]:
# filter out rows that a) do not contain lyrics or b) are not in english
data = data[(data['Lyrics']!="none")&(data['language']=='en')]

# Make lowercase.
data["Lyrics"] = [lyric.lower() for lyric in data["Lyrics"]]

# Remove new line characters.
data['Lyrics'] = data['Lyrics'].str.replace('\n',' ', regex=True)

# Remove anything that is not a word char or whitespace.
data["Lyrics"] = data["Lyrics"].str.replace(r"[^\w\s]", "", regex=True)

# Tokenizing, removing stop works and lemmatize words.
nltk.download('stopwords')
nltk.download("punkt")
nltk.download("wordnet")

lemmatizer = WordNetLemmatizer()

def preprocess_lyrics(lyrics):
    lyrics = re.sub(r'\d+', '', lyrics)

    

    tokens = word_tokenize(lyrics)

    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]
    
    # removing words with accents (some of my songs are bilingual, which will not work well with lemmatization)
    tokens = [word for word in tokens if not any(ord(char) > 127 for char in word)]

    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return " ".join(lemmatized_tokens)

# Apply preprocessing to your dataframe's lyrics
data["Lyrics"] = data["Lyrics"].apply(preprocess_lyrics)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Lyrics"] = [lyric.lower() for lyric in data["Lyrics"]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Lyrics'] = data['Lyrics'].str.replace('\n',' ', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Lyrics"] = data["Lyrics"].str.replace(r"[^\w\s]", "", regex=True)
[nltk_

In [6]:
data

Unnamed: 0,Song,Artist,Lyrics,Duration (seconds),Artist Followers,language
1,These Words,Natasha Bedingfield,heart word uh yeah uh threw chord together com...,136.580,41112.0,en
3,Vanish Into You,Lady Gaga,saw face mine picture bedside cold summertime ...,244.915,35745647.0,en
5,She's A Lady,Tom Jones,well shes youd ever want shes kind id like fla...,174.146,1113997.0,en
6,Anxiety,Doechii,anxiety keep tryin feel quietly tryin silence ...,249.302,2342579.0,en
15,O My Heart,Mother Mother,oh heart fish water oh heart fish rock bakes b...,211.453,4349191.0,en
...,...,...,...,...,...,...
1312,Rescue,Lauren Daigle,hidden there never moment forgotten hopeless t...,215.613,2969606.0,en
1313,Wild,Beach House,mother said would get trouble father wont come...,219.453,2540586.0,en
1314,You Came To Me,Beach House,invite sister garden play fistful wildflower h...,245.333,2540586.0,en
1315,Chariot,Beach House,sunny day chariot waving back losing touch muc...,316.507,2540586.0,en


# Maybe here is where the next notebook should have started

## TF-IDF
Using TF-IDF (Term Frequency - Inverse Document Frequency) to create document term matrix.

In [7]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2)

X = vectorizer.fit_transform(data['Lyrics']) 

df_tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  upcast = np.find_common_type(args, [])


In [8]:
df_tfidf

Unnamed: 0,aah,abandon,able,absence,absolutely,absolution,abuse,acceptance,accepted,ache,...,york,youd,youll,young,younger,youre,youth,youve,zone,zoom
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.023891,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.040854,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.033001,0.0,0.0,0.0,...,0.0,0.019717,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
897,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.027064,0.0,0.000000,0.0,0.0
898,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.106620,0.0,0.0
899,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
900,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0


## LDA

In [9]:
n_topics = 3
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_model.fit(X)

n_top_words = 10 # num words to show per topic
terms = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda_model.components_):
    print(f"Topic #{topic_idx + 1}:")
    print(" ".join([terms[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


# assigning topics to each song
topic_assignments = lda_model.transform(X).argmax(axis=1)
data['Topic'] = topic_assignments


Topic #1:
love ahh perfect everlasting ahah hoo oohoohooh nah dumb da

Topic #2:
dancin guerrilla warned survive og envious killed belongs rainbow release

Topic #3:
im dont love know oh like na got youre want



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Topic'] = topic_assignments


## Interpretation

**Topic #0**

*love, ahh, perfect, everlasting, ahah, hoo, oohoohooh, nah, dumb, da*

This topic is strongly relationship and love-focused. Words like "love," "perfect," and "everlasting" suggest a theme of idealized or eternal love. The repeated "ahh", "hoo", and other exclamatory sounds often found in music could imply emotional outbursts or expressions of joy and passion. "Nah" and "dumb" introduce a touch of negativity or conflict but still within a romantic or emotional context. This topic likely represents songs with an idealized, passionate, and possibly conflicted love theme.


**Topic #1**

*dancin, guerrilla, warned, survive, og, envious, killed, belongs, rainbow, release*

This topic seems to capture themes of struggle, defiance, and survival, possibly with a sense of rebellion or empowerment. Words like "dancin" and "rainbow" could reflect freedom or joy, while "guerrilla," "warned," "killed," and "survive" evoke ideas of resilience, danger, or social struggle. "OG" (often slang for "original gangster") might suggest themes of authenticity, resistance, or street culture. "Envious" and "release" could indicate emotional release after facing challenges. This topic likely represents songs with themes of struggle, survival, and rebellion.


**Topic #2**

*im, dont, love, know, oh, like, na, got, youre, want*

This topic seems to represent emotional uncertainty or self-doubt in the context of love or relationships. Words like "dont," "know," and "want" suggest confusion or indecision, and the repeated use of "love" and "like" shows that these lyrics likely revolve around personal relationships and emotional conflict. The use of "oh" and "na" are informal and often used to express exasperation, frustration, or emotion in lyrics. This topic likely represents songs dealing with uncertainty, confusion, or questioning in relationships.

In [18]:
data[data['Topic']==2]

Unnamed: 0,Song,Artist,Lyrics,Duration (seconds),Artist Followers,language,Topic
1,These Words,Natasha Bedingfield,heart word uh yeah uh threw chord together com...,136.580,41112.0,en,2
3,Vanish Into You,Lady Gaga,saw face mine picture bedside cold summertime ...,244.915,35745647.0,en,2
5,She's A Lady,Tom Jones,well shes youd ever want shes kind id like fla...,174.146,1113997.0,en,2
6,Anxiety,Doechii,anxiety keep tryin feel quietly tryin silence ...,249.302,2342579.0,en,2
15,O My Heart,Mother Mother,oh heart fish water oh heart fish rock bakes b...,211.453,4349191.0,en,2
...,...,...,...,...,...,...,...
1312,Rescue,Lauren Daigle,hidden there never moment forgotten hopeless t...,215.613,2969606.0,en,2
1313,Wild,Beach House,mother said would get trouble father wont come...,219.453,2540586.0,en,2
1314,You Came To Me,Beach House,invite sister garden play fistful wildflower h...,245.333,2540586.0,en,2
1315,Chariot,Beach House,sunny day chariot waving back losing touch muc...,316.507,2540586.0,en,2


## LDA Visualizations

In [11]:
import pyLDAvis.lda_model
# Prepare the visualization
vis = pyLDAvis.lda_model.prepare(lda_model, X, vectorizer)

# Display the visualization
pyLDAvis.display(vis)


It sounds like your LDA model is struggling to distinguish distinct topics, which can happen if:  

1. **Your lyrics share too many common words** (e.g., love, heart, night, etc.), leading LDA to think they belong to the same topic.  
2. **The number of topics (K) is too high or too low**—if too high, some topics may remain empty; if too low, it might force unrelated lyrics into the same topic.  
3. **Your preprocessing might need tuning**—removing stopwords, lemmatization, or bigrams/trigrams can help uncover meaningful differences.  
4. **Your dataset might not be diverse enough**—if all your songs are from similar genres or artists, they might truly belong to one dominant topic.  

### Steps to Improve Your LDA Results:  
✅ **Check Topic Coherence** – Use coherence scores (`gensim.models.CoherenceModel`) to find an optimal topic number.  
✅ **Adjust `alpha` and `beta` Hyperparameters** – Lower `alpha` makes documents more focused on fewer topics, and lower `beta` makes topics more distinct.  
✅ **Increase Topic Count Gradually** – If you set too few topics, LDA might lump everything together. Try increasing it in small steps.  
✅ **Try TF-IDF instead of Bag-of-Words** – TF-IDF can reduce the influence of frequent words that don't define a topic well.  
✅ **Visualize with pyLDAvis** – This helps see how topics overlap and whether some need better separation.  

