## Requirements
* Load data into dataframe
* Perform appropriate text cleaning & text preprocessing
* Train nearest neighbors model
* Generate song recommendations for:
  * Frosty The Snowman
  * Graceland
  * Love Is A Rose

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 02/24/24 22:43:24


# Clustering - Nearest Neighbor

Data file:
* https://raw.githubusercontent.com/vjavaly/Baruch-CIS-4120/main/data/songdata_40k.csv

### Import libraries

In [2]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [3]:
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load data

In [4]:
# Load data file into dataframe
df = pd.read_csv('https://raw.githubusercontent.com/vjavaly/Baruch-CIS-4120/main/data/songdata_40k.csv')
df.shape

(40000, 4)

### Examine data

In [5]:
df.head()

Unnamed: 0,artist,song,link,text
0,Wishbone Ash,Right Or Wrong,/w/wishbone+ash/right+or+wrong_20147150.html,Like to have you 'round \nWith all the lies t...
1,Aerosmith,This Little Light Of Mine,/a/aerosmith/this+little+light+of+mine_2064448...,"This Little Light of Mine (Light of Mine), \n..."
2,Fall Out Boy,"Dance, Dance",/f/fall+out+boy/dance+dance_10113666.html,She says she's no good with words but I'm wors...
3,Janis Joplin,Easy Rider,/j/janis+joplin/easy+rider_10147381.html,"Hey mama, mama, come a look at sister, \nShe'..."
4,Moody Blues,Peak Hour,/m/moody+blues/peak+hour_20291295.html,I see it all through my window it seems. \nNe...


### Prepare data

In [6]:
# Drop unnecessary column 'link'
df.drop(['link'], axis=1, inplace=True)
df.head()

Unnamed: 0,artist,song,text
0,Wishbone Ash,Right Or Wrong,Like to have you 'round \nWith all the lies t...
1,Aerosmith,This Little Light Of Mine,"This Little Light of Mine (Light of Mine), \n..."
2,Fall Out Boy,"Dance, Dance",She says she's no good with words but I'm wors...
3,Janis Joplin,Easy Rider,"Hey mama, mama, come a look at sister, \nShe'..."
4,Moody Blues,Peak Hour,I see it all through my window it seems. \nNe...


In [7]:
# Set the index to be the song name
df.set_index('song', inplace=True)
df.head()

Unnamed: 0_level_0,artist,text
song,Unnamed: 1_level_1,Unnamed: 2_level_1
Right Or Wrong,Wishbone Ash,Like to have you 'round \nWith all the lies t...
This Little Light Of Mine,Aerosmith,"This Little Light of Mine (Light of Mine), \n..."
"Dance, Dance",Fall Out Boy,She says she's no good with words but I'm wors...
Easy Rider,Janis Joplin,"Hey mama, mama, come a look at sister, \nShe'..."
Peak Hour,Moody Blues,I see it all through my window it seems. \nNe...


In [8]:
# Remove newline characters '\r' & '\n'
df['text'] = df['text'].str.replace('\r', ' ')
df['text'] = df['text'].str.replace('\n', ' ')
# Remove extra spaces from column
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True).str.strip()

In [9]:
pd.set_option('max_colwidth', None)

In [10]:
df['text'].head(2)

song
Right Or Wrong               Like to have you 'round With all the lies that you make The things or darkness and you Some people say, have just a taste Right or wrong, you might get burned What you gain is what you learn Got one too many women Don't know quite which way to go They're all gettin' so expensive When they walk by themselves Right or wrong, don't regret What you went for is what you get No point in bitter tears When someone else has cut you down 'Cause there's a time for leavin' And there's a time for stickin' around, hey Right or wrong, you've got to live So what you collect is what you give
This Little Light Of Mine                                                                                                                                                                           This Little Light of Mine (Light of Mine), I'm Let it shine (Aleilujah), This Little Light of Mine, I'm gonna let it shine, Down in my heart (In my heart), I'm gonna let it shine (Aleiluja

In [11]:
# Display first few rows of updated dataframe
df.head()

Unnamed: 0_level_0,artist,text
song,Unnamed: 1_level_1,Unnamed: 2_level_1
Right Or Wrong,Wishbone Ash,"Like to have you 'round With all the lies that you make The things or darkness and you Some people say, have just a taste Right or wrong, you might get burned What you gain is what you learn Got one too many women Don't know quite which way to go They're all gettin' so expensive When they walk by themselves Right or wrong, don't regret What you went for is what you get No point in bitter tears When someone else has cut you down 'Cause there's a time for leavin' And there's a time for stickin' around, hey Right or wrong, you've got to live So what you collect is what you give"
This Little Light Of Mine,Aerosmith,"This Little Light of Mine (Light of Mine), I'm Let it shine (Aleilujah), This Little Light of Mine, I'm gonna let it shine, Down in my heart (In my heart), I'm gonna let it shine (Aleilujah) Down in my heart (In My heart) I'm gonna let it, let it shine. All over the world (All over the world), I'm gonna let it shine (Let it shine, let it shine let it shine) Let it shine, let it shine, let it shine, let it shine"
"Dance, Dance",Fall Out Boy,"She says she's no good with words but I'm worse Barely stuttered out A joke of a romantic stuck to my tongue And weighed down with words too overdramatic Tonight it's ""it can't get much worse"" Vs. ""no one should ever feel like.."" I'm two quarters and a heart down And I don't want to forget how your voice sounds These words are all I have so I write them I need them just to get by Dance, dance We're falling apart to half time Dance, dance And these are the lives you love to lead Dance this is the way they'd look If they knew how misery loved me You always fold just before you're found out Drink up its last call Last resort But only the first mistake and I I'm two quarters and a heart down And I don't want to forget how your voice sounds These words are all I have so I write them I need them just to get by Why don't you show me a little bit of spine You've been saving for his mattress (love) Dance, dance We're falling apart to half time Dance, dance And these are the lives you love to lead Dance this is the way they'd look If they knew how misery loved me Why don't you show me a little bit of spine You've been saving for his mattress (with love) I only want sympathy in the form of you crawling into bed with me Dance, dance, we're falling apart to half time Dance, dance, and these are the lives you love to lead Dance this is the way they'd look If they knew how misery loved me"
Easy Rider,Janis Joplin,"Hey mama, mama, come a look at sister, She's a-standing on the levee trying to do that twist, But easy rider don't you deny my name, Oh no, oh no. Well, I got a girl with a diamond ring, I'll tell you, boys, she knows how to shake that thing. Oh! Easy rider don't you deny my name, Oh no, oh no. Play it! Well, I got a horse and he lives in a tree, He watches Huckleberry Hound on his tv. But easy rider don't you deny my name, Oh no, oh no. I would buy you a plastic suit And I would even buy you some cardboard fruit. Oh! But easy rider don't you deny my name, Oh no, oh no. Yeah, easy rider, don't you deny my name, pretty baby doll I said, easy rider, don't you deny my name, pretty baby doll I said, easy rider, don't you deny my name, pretty baby doll I said, easy rider, don't you deny my name, pretty baby..."
Peak Hour,Moody Blues,"I see it all through my window it seems. Never failing, like millions of eels. All that is wrong, No time to be won. Only to do What can be done. Peak hour, Peak hour, Peak hour. Minds are subject to what should be done. Problem solved, time cannot be won. One hour a day, One hour a night Sees crowds of people Home-aimed for flight. Peak hour, Peak hour, Peak hour. It makes me want to run out and tell them They've got time. Take a step back out and warn them I've found out I've got time. Minds are subject to what should be done. Problem solved, time cannot be won. One hour a day, One hour a night Sees crowds of people Home-aimed for flight. Peak hour, Peak hour, Peak hour."


### Create function to preprocess text

In [12]:
lem = WordNetLemmatizer()

In [13]:
def clean_text(text):
  punct = string.punctuation
  stop = stopwords.words('english')
  text = "".join([word.lower() for word in text if word not in punct])
  tokens = re.split('\W+', text)
  text = [lem.lemmatize(word) for word in tokens if word not in stop]
  text_2 = ' '.join(word for word in text)
  return text_2

In [14]:
# Apply function to clean column 'text'
#  BE PATIENT: This cell could take a minute to execute with the larger data file
%%time

df['text_clean'] = df['text'].apply(clean_text)

CPU times: user 57.3 s, sys: 1.09 s, total: 58.4 s
Wall time: 1min 2s


In [15]:
# Display first few rows of updated dataframe
df[['text','text_clean']].head(2)

Unnamed: 0_level_0,text,text_clean
song,Unnamed: 1_level_1,Unnamed: 2_level_1
Right Or Wrong,"Like to have you 'round With all the lies that you make The things or darkness and you Some people say, have just a taste Right or wrong, you might get burned What you gain is what you learn Got one too many women Don't know quite which way to go They're all gettin' so expensive When they walk by themselves Right or wrong, don't regret What you went for is what you get No point in bitter tears When someone else has cut you down 'Cause there's a time for leavin' And there's a time for stickin' around, hey Right or wrong, you've got to live So what you collect is what you give",like round lie make thing darkness people say taste right wrong might get burned gain learn got one many woman dont know quite way go theyre gettin expensive walk right wrong dont regret went get point bitter tear someone else cut cause there time leavin there time stickin around hey right wrong youve got live collect give
This Little Light Of Mine,"This Little Light of Mine (Light of Mine), I'm Let it shine (Aleilujah), This Little Light of Mine, I'm gonna let it shine, Down in my heart (In my heart), I'm gonna let it shine (Aleilujah) Down in my heart (In My heart) I'm gonna let it, let it shine. All over the world (All over the world), I'm gonna let it shine (Let it shine, let it shine let it shine) Let it shine, let it shine, let it shine, let it shine",little light mine light mine im let shine aleilujah little light mine im gonna let shine heart heart im gonna let shine aleilujah heart heart im gonna let let shine world world im gonna let shine let shine let shine let shine let shine let shine let shine let shine


### Vectorize cleaned 'text' column using TF-IDF

In [16]:
#  BE PATIENT: This cell could take a minute to execute with the larger data file
%%time

tfidf = TfidfVectorizer(ngram_range=(1, 3))
tfidf_matrix = tfidf.fit_transform(df['text_clean'])

CPU times: user 34 s, sys: 1.37 s, total: 35.4 s
Wall time: 35.6 s


In [17]:
tfidf_matrix.shape

(40000, 4317974)

In [18]:
tfidf_matrix

<40000x4317974 sparse matrix of type '<class 'numpy.float64'>'
	with 9194568 stored elements in Compressed Sparse Row format>

In [19]:
song_indices = pd.Series(df.index)
song_indices

0                   Right Or Wrong
1        This Little Light Of Mine
2                     Dance, Dance
3                       Easy Rider
4                        Peak Hour
                   ...            
39995                  Act Of Love
39996                     Luv Lies
39997      Something To Believe In
39998           See Through Dreams
39999             Manic Depression
Name: song, Length: 40000, dtype: object

### Train NearestNaighbors model

In [20]:
nn_model = NearestNeighbors(n_neighbors=6, algorithm='brute')

In [21]:
%%time

nn_model.fit(tfidf_matrix)

CPU times: user 32.7 ms, sys: 18.9 ms, total: 51.6 ms
Wall time: 58.7 ms


### Create nearest neighbors song recommender function

In [22]:
def recs_nn(seed_song_name, df=df, nn_model=nn_model):

  # Get song index of the seed song name
  seed_song_index = song_indices[song_indices == seed_song_name].index[0]

  # Get nearest neighbors distances and indexes for seed song
  distances, indices = nn_model.kneighbors(tfidf_matrix[seed_song_index])

  # Print the seed song name and the top 5 closest matching songs
  count = 0
  print(f"Seed song [{seed_song_index}] : {seed_song_name} [{df.iloc[seed_song_index]['artist']}]")
  for i in indices[0]:
    count += 1
    if count == 1:
      continue
    song = list(df.index)[i]
    print(f"Recommendation {count} [{i}] : {song} [{df.iloc[i]['artist']}]")

### Generate song recommendations for:
* Frosty The Snowman
* Graceland
* Love Is A Rose

In [23]:
recs_nn("Frosty The Snowman", df, nn_model)

Seed song [167] : Frosty The Snowman [Nat King Cole]
Recommendation 2 [1444] : Frosty The Snowman [Christmas Songs]
Recommendation 3 [18182] : Frosty The Snowman [Gloria Gaynor]
Recommendation 4 [19593] : Frosty The Snowman [Harry Connick, Jr.]
Recommendation 5 [35550] : Frosty The Snowman [Bing Crosby]
Recommendation 6 [24467] : Frosty The Snowman [Raffi]


In [24]:
recs_nn("Graceland", df, nn_model)

Seed song [9400] : Graceland [Paul Simon]
Recommendation 2 [15704] : Jesus Mentioned [Warren Zevon]
Recommendation 3 [23897] : Wayfaring Stranger [Ed Sheeran]
Recommendation 4 [12405] : Goin' Home [Conway Twitty]
Recommendation 5 [1463] : Memphis, Tennessee [Roy Orbison]
Recommendation 6 [2694] : Going, Going, Gone [Bob Dylan]


In [25]:
recs_nn("Love Is A Rose", df, nn_model)

Seed song [4868] : Love Is A Rose [Linda Ronstadt]
Recommendation 2 [1067] : And Roses And Roses [Andy Williams]
Recommendation 3 [37851] : Give My Love To Rose [George Jones]
Recommendation 4 [35150] : Give My Love To Rose [Johnny Cash]
Recommendation 5 [9952] : Rose Rose I Love You [Frankie Laine]
Recommendation 6 [31948] : Like A Rose On The Grave Of Love [Xandria]
