# **Final Project**
## *DATA 5420/6420*
## Name: Dallin Moore

The purpose of the final project is to produce an MVP that is a culmination of the skills you have learned in each of the previous units. This MVP should be a cohesive product in that it combines methods in some logical pipeline, it should NOT simply be a collection of methods implemented independently/separately with no clear end goal/state. You will be tasked with applying at least four methods from across the four units, which I've outlined below:

### Unit 1

* Chatbots
* Basic Text Statistics
* NLP Pipeline (Preprocessing & Normalization)
* Compiling Corpora via APIs

### Unit 2

* Bag of Words Models (TF-IDF and Count Vectorization)
* Document Classification
* Sentiment Analysis

### Unit 3

* Document Summarization
* Topic Modeling
* Text Similarity
> * Information Retrieval (Search)
> * Recommendation Systems
* Document Clustering
> * KMeans
> * Affinity Prop
> * Wards Agglomerative Hierarchical

### Unit 4

* Word Embeddings
* Pretrained Transformers
* Question-Answering Systems
* Speech-to-Text (hopefully)

You will of course need to perform some form of cleaning/text normalization and feature engineering (bag of words and/or word embeddings), but the way you go about that will be problem dependent -- on top of those two steps, you will need to incorporate at least two other model types as well that form some coherent end-stage MVP.

For example:

1) corpus of a news articles pulled from the Bing News API that is cleaned/normalized

2) uses word embeddings to feature engineer the text

3) performs sentiment analysis to score sentiment of all articles

4) articles are sortable by sentiment, and ranked based on their relevance to keywords/search queries (information retrieval)

The MVP is a NewsFeed showing a table of articles displayed in an interactive dashboard

As you are performing your analyses consider:

* What cleaning and normalization steps are necessary for my text, and which are not?
* What sort of feature engineering do I need to utilize, both in terms of using BoW or word embeddings, and in terms of document or word vectorization? Do I need to use different methods for different analysis types?
* What is the purpose of performing your selected methods and how do they meaningful build on one another?
* What are the practical applications of the models you developed?

### **What methods have you chosen and how do they fit together?**



- **Method 1** – NLP Pipeline (Preprocessing & Normalization)
 - Preprocessing steps include, removing stopwords, and lemmatization. The lyrics will be tokenized on by line.
- **Method 2** – Context Embeddings (SBERT)
 - Use word embddings to create feature representations of song lyrics, which can be further utilized for the other steps.
- **Method 3** – Document Clustering (KMeans/DBSCAN)
 - Apply a clustering algorithm (probably KMeans) to the selected similar songs to group them into clusters. Create a labeled dataset to use for my MVP.
- **Method 4** – Text Similarity (Recommendation System)
 - Compute the similarity between the input song's lyrics and all other songs in the dataset. Identify the top N most similar songs to the input song based on their lyrics.


**How these methods will be meaningfully combined**:

The user will have the option of either choosing a specific theme, entering a prompt, or both to create a unique playlist with links to each of the songs on Spotify. Using Streamlit to develop an MVP, there will simply be a title, a dropdown of themes, a search bar, and space for the output with links.

## **Imports** (Dependencies and Files)

In [None]:
!pip install langdetect
!pip install -U sentence-transformers
from langdetect import detect
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import re
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from scipy.spatial.distance import euclidean

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')


In [2]:
df_all_songs = pd.read_csv('C:/Users/04drm/Downloads/data_5420/data_5420/Spotify-Million-Song-Dataset.csv')
df_all_songs.head()

Unnamed: 0,artist,song,link,text
0,ABBA,She's My Kind Of Girl,/a/abba/shes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


### **Method 1** – NLP Pipeline (Preprocessing & Normalization)


There are some songs that are not in English that will be removed.

In [5]:
def is_english(text):
    try:
        lang = detect(text)
        return lang == 'en'
    except:
        return False

In [6]:
df_all_songs['IsEnglish'] = df_all_songs['text'].apply(is_english)
df = df_all_songs[df_all_songs['IsEnglish']]
df = df.drop(columns=['IsEnglish'])
df.reset_index(drop=True, inplace=True)

Preprocess the textby removing stopwords, lemmatizing text, normalizing text, etc.

In [14]:
def preprocess(text):
    lines_output = []
    lines = text.split('\n')
    for text in lines:
        text = text.lower()
        stop_words = set(stopwords.words('english')) # stopword removal
        stop_words.update(['chorus'])
        text = ' '.join([word for word in text.split() if word not in stop_words])
        text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
        text = re.sub(r'\d+', '', text) # Remove numerical data
        text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove special characters/symbols
        # Lemmatize text
        lemmatizer = WordNetLemmatizer()
        text = ' '.join([lemmatizer.lemmatize(word) for word in text.split(' ')])
        if len(text) > 0:
          lines_output.append(text)
        
    return ' '.join(lines_output)

df['preprocessed_text'] = df['text'].apply(preprocess)
print(df['preprocessed_text'].head())

0    look face wonderful face mean something specia...
1    take easy me please touch gently like summer e...
2    ill never know go put lousy rotten show boy to...
3    making somebody happy question give take learn...
4    making somebody happy question give take learn...
Name: preprocessed_text, dtype: object


## **Method 2** – Context Embeddings (SBERT)

Due to the size of the dataset and the effectiveness of Context Embeddings, SBERT will be used.

In [15]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [16]:
def get_embeddings(text):
    return model.encode(text)

# Generating sentence embedding from the text
df['song_embeddings'] = df['preprocessed_text'].apply(get_embeddings)

## **Method 3** – Clustering by Theme (KMeans, DBSCAN)

The two clustering methods that can be easily used on a large dataset are KMeans and DBSCAN. Both will be used in order to evaluate which preforms best.

### Run KMeans

In [18]:
num_clusters = 20
kmeans_model = KMeans(n_clusters=num_clusters, random_state=42)
kmeans_model.fit(df['song_embeddings'].to_list())
df['kmeans_label'] = kmeans_model.labels_

### Run DBSCAN

In [109]:
eps_dbscan = .8  # maximum distance between two samples for one to be considered as in the neighborhood of the other
min_samples_dbscan = 4  # number of samples in a neighborhood for a point to be considered as a core point
dbscan_model = DBSCAN(eps=eps_dbscan, min_samples=min_samples_dbscan)
dbscan_model.fit(df['song_embeddings'].tolist())
df['dbscan_label'] = dbscan_model.labels_

### Output the top songs and words for each cluster

In [92]:
def extract_top_words(df, cluster_column, cluster_label, num_words):
    cluster_group = df[df[cluster_column] == cluster_label]
    cluster_text = ' '.join(cluster_group['preprocessed_text'])
    # Tokenize the text and count the occurrences of each word
    words = cluster_text.split()
    word_counts = pd.Series(words).value_counts()
    # Extract top words
    top_words = word_counts.index[:num_words].tolist()
    return top_words


def print_top_songs(df, cluster_column, clustering_model, num_songs=5, num_words=10):
    # Get unique cluster labels
    if isinstance(clustering_model, KMeans):
        num_clusters = clustering_model.n_clusters
    elif isinstance(clustering_model, DBSCAN):
        num_clusters = len(np.unique(clustering_model.labels_))
    else:
        raise ValueError("Unsupported clustering model.")
    
    # Iterate over each cluster label
    for cluster_label in range(num_clusters):
        # Count the number of songs in the cluster
        cluster_group = df[df[cluster_column] == cluster_label].copy()  # Explicitly create a copy
        cluster_count = len(cluster_group)
        
        # If the cluster has zero songs, print a message and continue to the next cluster
        if cluster_count == 0:
            print(f'Cluster "{cluster_label}": No songs in this cluster.')
            continue
        
        print(f'Cluster "{cluster_label}": ({cluster_count} songs)')
        
        # Extract key features of the cluster
        key_features = extract_top_words(cluster_group, cluster_column, cluster_label, num_words=num_words)
        print("Top words:", key_features)
        
        # Select songs and embeddings in the cluster
        cluster_embeddings = np.stack(cluster_group['song_embeddings'].to_numpy())
        
        if isinstance(clustering_model, KMeans):
            # Calculate centroid of the cluster
            centroid = clustering_model.cluster_centers_[cluster_label]
            # Calculate distance of each song's embedding from the centroid
            distances = [euclidean(embedding, centroid) for embedding in cluster_embeddings]
            
            # Add distances as a new column in the cluster group DataFrame
            cluster_group['distance_to_centroid'] = distances
            
            # Sort songs within the cluster based on distance to centroid
            sorted_cluster_group = cluster_group.sort_values(by='distance_to_centroid')
            
            # Print top songs closest to centroid
            for idx, row in sorted_cluster_group.head(num_songs).iterrows():
                print(f"- Song: {row['song']} by {row['artist']}")
            print()
        elif isinstance(clustering_model, DBSCAN):
            # Print the first songs
            for idx, row in cluster_group.head(num_songs).iterrows():
                print(f"- Song: {row['song']} by {row['artist']}")
            print()

In [90]:
print_top_songs(df, 'kmeans_label', kmeans_model, num_words = 15)

Cluster "0": (2886 songs)
Top words: ['life', 'know', 'im', 'love', 'time', 'one', 'never', 'see', 'day', 'way', 'world', 'cant', 'go', 'like', 'heart']
- Song: More Than You Know by Out Of Eden
- Song: Best Days Of My Life by Hanson
- Song: Change For A Better by Journey
- Song: Shattered By The Sun by Unearth
- Song: Walk Away by Westlife

Cluster "1": (2412 songs)
Top words: ['im', 'like', 'get', 'got', 'nigga', 'know', 'aint', 'yeah', 'go', 'cause', 'ya', 'see', 'shit', 'back', 'want']
- Song: I Got One For Ya by Kid Rock
- Song: Wait Up by Q-Tip
- Song: We Don't Give A by Fabolous
- Song: Be Easy by Wiz Khalifa
- Song: Just Like Me by Usher

Cluster "2": (2836 songs)
Top words: ['know', 'see', 'im', 'time', 'like', 'eye', 'cant', 'go', 'one', 'way', 'away', 'get', 'want', 'say', 'look']
- Song: Another Day by Dream Theater
- Song: Move With Me by Lenny Kravitz
- Song: Something Better by Natalie Imbruglia
- Song: The Ones Who Help To Set The Sun by Dream Theater
- Song: Directions

In [110]:
print_top_songs(df, 'dbscan_label', dbscan_model, num_words = 15)

Cluster "0": (40162 songs)
Top words: ['im', 'love', 'know', 'like', 'oh', 'got', 'time', 'go', 'get', 'baby', 'one', 'come', 'want', 'let', 'see']
- Song: She's My Kind Of Girl by ABBA
- Song: As Good As New by ABBA
- Song: Cassandra by ABBA
- Song: Chiquitita by ABBA
- Song: Crazy World by ABBA

Cluster "1": (4 songs)
Top words: ['fortune', 'said', 'teller', 'told', 'im', 'love', 'next', 'looking', 'could', 'know', 'happy', 'fool', 'something', 'eye', 'left']
- Song: Fortune Teller by Alison Krauss
- Song: Fortune Teller by Rolling Stones
- Song: Fortune Teller by Hollies
- Song: Fortune Teller by Who

Cluster "2": (8 songs)
Top words: ['auld', 'lang', 'syne', 'o', 'well', 'cup', 'kindness', 'yet', 'acquaintance', 'forgot', 'dear', 'take', 'brought', 'never', 'mind']
- Song: Auld Lang Syne by Barbra Streisand
- Song: Auld Lang Syne by Beach Boys
- Song: Auld Lang Syne by Harry Connick, Jr.
- Song: Auld Lang Syne (The New Year's Anthem) by Mariah Carey
- Song: Auld Lang Syne by Neil D

After tweaking the results and evaluating the outcomes, KMeans is the better choice for this dataset and our chosen feature engineering.

### Give the clusters labels

ChatGPT is used to come up with more descriptive labels for each of the clusters.

In [112]:
kmeans_label_dict = {
    0: "Life Reflections",
    1: "Urban Vibes",
    2: "Introspective Moments",
    3: "Dynamic Energy",
    4: "Emotional Journeys",
    5: "Melancholic Moods",
    6: "Romantic Whispers",
    7: "Nightlife Chronicles",
    8: "Skyward Serenity",
    9: "Heartfelt Longings",
    10: "Enduring Affections",
    11: "Casual Affections",
    12: "Everyday Musings",
    13: "Soulful Expressions",
    14: "Eclectic Beats",
    15: "Worldly Observations",
    16: "Spiritual Devotions",
    17: "Dreamy Escapes",
    18: "Festive Cheer",
    19: "Lighthearted Moments"
}

In [113]:
df['kmeans_label_text'] = df['kmeans_label'].map(kmeans_label_dict)

Let's take a look at some random records.

In [115]:
df.sample(n=5)

Unnamed: 0,artist,song,link,text,preprocessed_text,song_embeddings,kmeans_label,dbscan_label,kmeans_label_text
44935,Neil Sedaka,I Let You Walk Away,/n/neil+sedaka/i+let+you+walk+away_20247423.html,I believe that nothing lasts forever \nPromis...,believe nothing last forever promise help u st...,"[-0.057967935, 0.004077129, 0.06776754, -0.012...",0,0,Life Reflections
4353,Donna Summer,I Don't Wanna Work,/d/donna+summer/i+dont+wanna+work_10087984.html,Well I give it little care \nDoes anybody out...,well give little care anybody even know im her...,"[-0.09957706, -0.13900116, 0.05114976, 0.04368...",4,0,Emotional Journeys
50877,Santana,My Man,/s/santana/my+man_20535819.html,Boom Boom Boom \nSantana's in the room \nBoo...,boom boom boom santanas room boom boom boom ma...,"[-0.081187785, -0.13603832, 0.06563344, -0.025...",11,0,Casual Affections
44918,Neil Sedaka,Calendar Girl,/n/neil+sedaka/calendar+girl_20169378.html,"I love, I love, I love my calender girl \nYea...",love love love calender girl yeah sweet calend...,"[-0.07577728, -0.016359465, 0.119390465, 0.016...",18,0,Festive Cheer
12485,Matt Redman,Gloria,/m/matt+redman/gloria_21028003.html,The skies are filled with Your glory \nThe oc...,sky filled glory ocean mirror grace deep high ...,"[-0.03303984, 0.029969208, 0.046279937, -0.089...",16,0,Spiritual Devotions


## **Method 4** – Text Similarity (Recommendation System)

In [149]:
user_query = input("Give some input for your playlist: ")
query_embedding = get_embeddings(user_query)
# make a 2d array, changing the shape from (384,) to (1,384)
query_embedding_2d = query_embedding.reshape(1,-1)
# stack the vectors from every song into a 2d array, changing the shape from (57187,) to (57187,384)
song_embeddings_2d = np.stack(df['song_embeddings'].values)

similarity_scores = cosine_similarity(query_embedding_2d, song_embeddings_2d)

top_recommendations = np.argsort(similarity_scores.flatten())[::-1]

# Print top recommendations
print("Top recommended songs:")
for i in range(25):  # Print top 10 recommendations
    song_index = top_recommendations[i]
    song_title = df['song'].loc[song_index]
    artist = df['artist'].loc[song_index]
    print("Song:", song_title, "by", artist)

Enter your query: something isn't right
Top recommended songs:
Song: You Were Wrong by Freddie King
Song: Right Or Wrong by Patsy Cline
Song: Definition Of Wrong by Travis
Song: Right Or Wrong by Wanda Jackson
Song: What If by Coldplay
Song: High Sierra by Linda Ronstadt
Song: High Sierra by Dolly Parton
Song: So Wrong by Patsy Cline
Song: Mother Stands For Comfort by Kate Bush
Song: It Ain't Right by Eric Clapton
Song: I Need Something Stronger by Unkle
Song: When Something Is Wrong With My Baby by Otis Redding
Song: Not Right by Iggy Pop
Song: Perfect World by Talking Heads
Song: Am I Very Wrong? by Genesis
Song: Which One Of Them by Garth Brooks
Song: Stone Cold Heart by Incognito
Song: Two Rights by Allman Brothers Band
Song: It's Alright With Me by Harry Connick, Jr.
Song: Wrong All Along by Cheap Trick
Song: Is It Wrong (For Loving You) by Loretta Lynn
Song: Cowboy Song by Arlo Guthrie
Song: Check It by Gary Numan
Song: Often by Robbie Williams
Song: What If I by Bosson


## Save the Data

The text data can be saved in a .csv file with the theme labels and none of the extra columns. The song embedding vectors can be saved as a .npy file to preserve all of the information necessary for the recomendation system.

In [152]:
np.save('Song_Embeddings.npy',song_embeddings_2d)

In [153]:
df_export = df.drop(columns=['text','preprocessed_text','song_embeddings','kmeans_label', 'dbscan_label'])
df_export.to_csv('Labeled_Song_Dataset.csv')