### Final Project Part 2 
Group 6: Paul Miller, Aimee Flynn, Barza Fayazi-Azad <br>
Song Recommendation Engine 


## Preprocessing
The preprocessing stage takes our CSV that includes some manual cleanup and normalizes the data. Normalization of the data includes turning text into lower case, removing special characters and whitespaces, and tokenization.

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 200
%matplotlib inline

In [7]:
#Reading in csv file into dataframe
rawdata = pd.read_csv("../data/Spotify-2000manualcleaned.csv")
rawdata.head(2)

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39


In [8]:
#For our Recommendation Engine, the team deemed only three columns were needed
needed_data = rawdata[["Title", "Artist", "Top Genre"]]
needed_data.head(1)

Unnamed: 0,Title,Artist,Top Genre
0,Sunrise,Norah Jones,adult standards


In [9]:
#Simple EDA
print('The shape of the data: ', needed_data.shape)
print()
print('The sum of null values: ', needed_data.isnull().sum())

The shape of the data:  (1994, 3)

The sum of null values:  Title        0
Artist       0
Top Genre    0
dtype: int64


In [10]:
#normalizing data
#Put each column into a list
titles = needed_data['Title'].tolist()
artists = needed_data['Artist'].tolist()
genres = needed_data['Top Genre'].tolist()
#Combine same elements in each list to create a corpus like data
corpus = [f"{elem1} {elem2} {elem3}" for elem1, elem2, elem3 in zip(titles, artists, genres)]

In [11]:
#Normalizing the data by turning text into lower case, removing special characters and whitespaces, and tokenization.
tokenizer = nltk.WordPunctTokenizer()
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

def normalized(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
  text = text.lower()
  text  = text.strip()
  tokens = tokenizer.tokenize(text)

  filtered_tokens = [token for token in tokens if token not in stop_words]
  text   = ' '.join(filtered_tokens)
  return text
normalizedText = np.vectorize(normalized)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pdmil\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [12]:
normCorpus = normalizedText(corpus)
normCorpus[0]

'sunrise norah jones adult standards'

## Feature Extraction
The feature extraction phase incudes a few different NLP methods, including implementation of a BOW model using CountVectorizer, implementing a TF-IDF model using TfidfTransformer, implementing a text similarity matrix using cosine similarity, and experimenting with a topic extraction model using LatentDirichletAllocation

In [14]:
#implementing BOW Model
from sklearn.feature_extraction.text import CountVectorizer
# get bag of words features in sparse format
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(normCorpus)

In [18]:
#Implementing TF-IDF Model
from sklearn.feature_extraction.text import TfidfTransformer

tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True)
tt_matrix = tt.fit_transform(cv_matrix)

tt_matrix = tt_matrix.toarray()
vocab = cv.get_feature_names_out()
pd.DataFrame(np.round(tt_matrix, 2), columns=vocab).head()

Unnamed: 0,aan,aanzoek,abba,abel,absolute,absolution,accidentally,acda,acdc,ace,...,zone,zonnestralen,zou,zoutelande,zucchero,zullen,zuuje,zwart,zweet,zz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
#Implementing text similary (Cosine)
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tt_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.135837,0.167904,0.135073,0.151826,0.149079,0.147947,0.0,0.0,0.132355
1,0.0,1.0,0.0,0.0,0.017824,0.013021,0.0,0.019696,0.020087,0.0,...,0.042465,0.0,0.021201,0.0,0.0,0.0,0.0,0.016649,0.0,0.0
2,0.0,0.0,1.0,0.055325,0.0,0.035211,0.0,0.0,0.0,0.247745,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.055325,1.0,0.0,0.042436,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.017824,0.0,0.0,1.0,0.011078,0.0,0.016757,0.017089,0.0,...,0.01223,0.0,0.018037,0.0,0.0,0.0,0.0,0.014164,0.0,0.0


In [34]:
similarity_df.to_csv('similarity_df.csv', index=False)

In [35]:
#Implementing Toplic Extraction
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation()
dt_matrix = lda.fit_transform(cv_matrix)
features = pd.DataFrame(dt_matrix)
features.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.016669,0.016667,0.569481,0.016667,0.297183,0.016667,0.016667,0.016667,0.016667,0.016667
1,0.871406,0.014286,0.014288,0.014288,0.014289,0.01429,0.014286,0.014286,0.014294,0.014288
2,0.014286,0.442866,0.014286,0.014286,0.014286,0.014288,0.442845,0.014286,0.014287,0.014286
3,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.85,0.016667,0.016667,0.016667
4,0.012501,0.0125,0.012501,0.0125,0.0125,0.0125,0.0125,0.012501,0.012501,0.887496


## Functionality
The functionality phase includes our main functionality, which is a recommendation engine function that outputs the top 10 most similar songs to the song inputted by the user. It uses the consine similarity matrix outputted during the feature extraction phase to pull the top 10 recommended songs. We also provide some testing of our functionality using a few songs at random.

In [50]:
# Recommendation Engine Function to output the top 10 most smilar songs to the user inputted title 

def song_rec_engine(song_title, full_data=rawdata, sim_matrix_df=similarity_df):
    
    # Get the id of the inputted song title
    song_index = np.where(full_data['Title'] == song_title)[0][0]
    
    # Get all the similarity values to the inputted song title
    similar_songs = sim_matrix_df.iloc[song_index].values
    
    # Get the IDs of the top 10 most similar songs to the inputted song title
    # We skip the first once since it will always be the same song that was inputted
    similar_song_ids = np.argsort(-similar_songs)[1:11]

    # print (f'similar_song_ids: {similar_song_ids}')
    
    # Get the names of the top 10 similar songs
    similar_song_names = full_data['Title'][similar_song_ids]
    similar_song_artists = full_data['Artist'][similar_song_ids]
    similar_song_genres = full_data['Top Genre'][similar_song_ids]
    
    # Return the song names 
    return [similar_song_names, similar_song_artists, similar_song_genres]

In [51]:
# Testing out the functionality with some random songs
# Eventually this will transition to a user-inputted song title
test_song_list = [rawdata['Title'][50], rawdata['Title'][123], rawdata['Title'][33], rawdata['Title'][86]]

for song in test_song_list:
    print("The song you selected was: {}".format(song))
    print("")
    print("The top 10 recommended songs are:")
    for s,a,g in zip(song_rec_engine(song)[0],song_rec_engine(song)[1],song_rec_engine(song)[2]):
        print("{} by {} (genre: {})".format(s.strip(),a.strip(),g.strip()))
    print("")

The song you selected was: Just Breathe

The top 10 recommended songs are:
Black by Pearl Jam (genre: alternative rock)
Alive by Pearl Jam (genre: alternative rock)
Daughter by Pearl Jam (genre: alternative rock)
Rearviewmirror by Pearl Jam (genre: alternative rock)
Jeremy by Pearl Jam (genre: alternative rock)
Even Flow by Pearl Jam (genre: alternative rock)
Breathe by The Prodigy (genre: big beat)
The Air That I Breathe by The Hollies (genre: adult standards)
Black Betty by Ram Jam (genre: album rock)
About A Girl by Nirvana (genre: alternative rock)

The song you selected was: Eternal Flame

The top 10 recommended songs are:
Let There Be Rock by AC/DC (genre: album rock)
Overture by The Who (genre: album rock)
It's All Over Now by The Rolling Stones (genre: album rock)
My Generation by The Who (genre: album rock)
Crazy On You by Heart (genre: album rock)
Is This Love by Whitesnake (genre: album rock)
Rock and Roll by Led Zeppelin (genre: album rock)
See Me, Feel Me by The Who (genre

## Personal Statements

Aimee Flynn <br>
On the Final Project Pt, 2, my first task was manually cleaning the data. Our data had song titles, some of
the titles had who was featured on the song, what version of the song, whether the song had been
remastered and the date, and whether the song was performed live. I went through and removed this
information from the file. My second task was reading the data in and preprocessing the data. I made sure
there were no null values in the data and I normalized the data. I also worked on the functionality such as
text similarity and topic extraction. The other two team members worked on the main functionality and
creating the app. <br> <br>
Paul Miller <br>
I performed some rough EDA on an early version of our data. I also participated in our team planning and brainstorming sessions. I have also begun research to build the web app and investigate possible hosting solutions we could use to deploy. <br><br>
Barza Fayazi-Azad <br>
For this assignment, my role was to take the preprocessing and feature extraction output from Aimee and implement the main functionality of our application, which was a song recommendation engine. I took the cosine similarity matrix output from the feature extraction and created a function that took a user inputted song title and outputted the top 10 most similar songs in our dataset, based on a corpus of song title, song artist, and song genre. I also tested the functionality using some random songs from the dataset to ensure that the recommendation system was working as intended.
 