# Project 2 Example
### Alec K. Mattu (UTA)
### INST 414
### 10/07/2021

# Instructions
(1) add your favorite song to the dataset as the last row. Manually enter the data entries of your favorite song (song name, artist name, year, and lyrics). 

(2) remove the stop-words from the lyrics of all songs in your dataset. You can use any of the packages we discussed in the lecture and use their built-in list of stop-words. 

(3) perform stemming using textblob. Add a new column and in it, for each song lyrics, store the stemmed version of the words. Call this column lyrics_stemmed.

(4) vectorize songs once using count vectorization and once using tfidf vectorization. So now for each song there is a count vector and a tfidf vector. Store these vectors in two separate dataframes. Call them count_vecs and tfidf_vecs. 

(5) find the cosine similarity of your favorite song to every song in the dataset, once using count_vecs and once using tfidf_vecs. 

(6) find the top 5 most similar songs to your favorite song, once using count_vecs and once using tfidf vecs. 

(7) Inspect the results of this simple recommender system manually. Between count_vecs and tfidf_vecs, which one has recommended songs that you deem somehow similar to your favorite song? Explain your answer with reasoning and observations in no more than 50 words, in a markdown cell at the bottom of the notebook.

# Solution

## Part 0 - Preparation

In [1]:
# Import required libraries
import csv
import seaborn as sns
import pandas as pd
import requests
import textblob
import nltk
import sklearn.feature_extraction 
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup as bs
from sklearn.preprocessing import minmax_scale

In [2]:
# Hold songs in a list
songs = []

# Open CSV file with handle
with open("dataset_out.csv") as csv_file:
    # Create a file reader
    file_reader = csv.DictReader(csv_file)
    
    # Loop through CSV rows
    for row in file_reader:
        songs.append(row)

## Part 1 - Favorite Song Information
Get the lyrics from azlyrics.com and then insert them into the CSV from Homework 1.

**OPTIONAL: Use the following approach instead of manually adding the song**

In [3]:
# Placeholder dict
song = {
    "Title": "taro",
    "Artist": "alt-j",
    "Year": 2012,
    "Lyrics": "",
}
page = requests.get("https://www.azlyrics.com/lyrics/altj/taro.html")

# Validate the lyric request
if (page.status_code != 200):
    print("Failed to download song lyrics")
    exit()

# Parse the HTML content from the request
soup = bs(page.content, "html.parser")
lyric_div = soup.find(class_= "col-xs-12 col-lg-8 text-center").find("div", attrs = {'class': None})   
lyrics = lyric_div.text.splitlines()
lyrics = list(filter(None, lyrics))

# Recombine the lyrics into a single string
song["Lyrics"] = str.join("\n", lyrics)
songs.append(song)

In [4]:
# Transform list into a DF
df = pd.DataFrame(songs)

In [5]:
# Drop the unnamed first column
df.drop(df.columns[0], axis = 1, inplace = True, errors = 'ignore')

# Drop normalized columns
df.drop(["normalized_L", "normalized_V", "normalized_D"], axis = 1, inplace = True, errors = 'ignore')

# Drop Z-Score columns
df.drop(["z_score_L", "z_score_V", "z_score_D"], axis = 1, inplace = True, errors = 'ignore')

# Drop L, V, D
df.drop(["L", "V", "D"], axis = 1, inplace = True, errors = 'ignore')

In [6]:
df

Unnamed: 0,Title,Artist,Year,Lyrics
0,the-battle,blood-sweat-tears,1970,While the king and queen lie sleeping\nAnd the...
1,hey-jude,count-basie,1970,"Hey Jude, don't make it bad\nTake a sad song a..."
2,time,david-bowie,1973,"Time, he's waiting in the wings\nHe speaks of ..."
3,we-can-make-the-world-a-whole-lot-brighter,the-brady-bunch,1972,"Birds flying high,\nIn search of a clear blue ..."
4,day-by-day,carmen-mcrae,1972,Day by day I'm falling more in love with you\n...
...,...,...,...,...
4496,tester,anthrax,2006,"I've changed, by staying the same\nWhat does i..."
4497,killing-me-inside,crossfade,2011,There's a dream that comes to me\nAnd it whisp...
4498,babel,cruel-tie,2015,"I'm stepping down, hurrin' up. Settle down. Do..."
4499,i-know-i-ve-been-changed,aaron-neville,2010,"Oh I, know I've been changed\nAnd I know I've ..."


## Part 2 - Remove Stop-Words

In [7]:
"""
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS is a FrozenSet (immutable)

Eg. 
word in text.ENGLISH_STOP_WORDS
"""

# OPTIONAL: Replace \n with space
df["Lyrics"] = df["Lyrics"].apply(lambda x: x.replace("\n", " ").replace(",", ""))

# LAMBDA Helper Function
# Remove stopwords from lyrics
def trim_stopwords(x):
    # Split the lyrics into a list
    lyrics = x.split(" ")
    
    # Iterate through the lyrics, compare the lowercase version to the stopword set
    new_lyrics = [w for w in lyrics if w.lower() not in sklearn.feature_extraction.text.ENGLISH_STOP_WORDS]
    
    # Join the lyrics at a space, return a string
    return " ".join(new_lyrics)
    
# Remove stopwords using lamda function
df["Lyrics"] = df["Lyrics"].apply(lambda x: trim_stopwords(x))

In [8]:
df.head()

Unnamed: 0,Title,Artist,Year,Lyrics
0,the-battle,blood-sweat-tears,1970,king queen lie sleeping daughters smile nice B...
1,hey-jude,count-basie,1970,Hey Jude don't make bad sad song make better R...
2,time,david-bowie,1973,Time he's waiting wings speaks senseless thing...
3,we-can-make-the-world-a-whole-lot-brighter,the-brady-bunch,1972,Birds flying high search clear blue sky they'r...
4,day-by-day,carmen-mcrae,1972,Day day I'm falling love day day love grow isn...


## Part 3 - Stemming with Textblob

In [9]:
# LAMBDA Helper Function
# Stem/Lemmatize String
def stem_lemm_words(x):
    # Define a new textblob for the lyric string
    tb_doc = textblob.TextBlob(x)
    
    # OPTIONAL: Lemmatize the lyrics
    # Use tb_doc.join instead
    lemm_list = [w.lemmatize() for w in tb_doc.words]
    
    # Join the lyrics at spaces, return a string
    return " ".join(lemm_list)

# Stem the lyrics
df["Lyrics_Stemmed"] = df["Lyrics"].apply(lambda x: stem_lemm_words(x))

In [10]:
df.head()

Unnamed: 0,Title,Artist,Year,Lyrics,Lyrics_Stemmed
0,the-battle,blood-sweat-tears,1970,king queen lie sleeping daughters smile nice B...,king queen lie sleeping daughter smile nice Br...
1,hey-jude,count-basie,1970,Hey Jude don't make bad sad song make better R...,Hey Jude do n't make bad sad song make better ...
2,time,david-bowie,1973,Time he's waiting wings speaks senseless thing...,Time he 's waiting wing speaks senseless thing...
3,we-can-make-the-world-a-whole-lot-brighter,the-brady-bunch,1972,Birds flying high search clear blue sky they'r...,Birds flying high search clear blue sky they '...
4,day-by-day,carmen-mcrae,1972,Day day I'm falling love day day love grow isn...,Day day I 'm falling love day day love grow is...


## Part 4 - Count/TF-IDF Vectorization

### CountVectorizer

In [22]:
# Various ways to achieve this
from sklearn.feature_extraction.text import CountVectorizer
# OPTIONAL: Normalize the matrix dataframe
from sklearn.preprocessing import normalize

# Initialize a BOWER
bag_of_words = CountVectorizer(tokenizer = lambda txt: txt.split())
docterm = bag_of_words.fit_transform(df["Lyrics"])

# Add document terms to a DataFrame
matrix_df = pd.DataFrame(normalize(docterm.toarray()),
             columns = bag_of_words.get_feature_names(),
             index = df["Title"])

In [21]:
count_vecs.head()

Unnamed: 0_level_0,!,"!!!""","""","""'cause","""....?""","""...and","""50","""?jacky?","""a","""a""",...,ïïî³îµî¹,ïî­ïî¹,ïî±î¸îµî¯,ïïï,ïî­î¼î±,ïïî±î½,ïïî¹,öfter,über,überreden
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the-battle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hey-jude,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
time,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
we-can-make-the-world-a-whole-lot-brighter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
day-by-day,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### TF-IDFVectorizer

In [13]:
# Various ways to achieve this
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize lyrics with TF-IDF
bag_of_words = TfidfVectorizer(tokenizer = lambda txt: txt.split())
docterm = bag_of_words.fit_transform(df["Lyrics"])

# Add document terms to a DataFrame
matrix_df = pd.DataFrame(docterm.toarray(), columns = bag_of_words.get_feature_names(), index = df["Title"])

In [14]:
tfidf_vecs.head()

Unnamed: 0_level_0,!,"!!!""","""","""'cause","""....?""","""...and","""50","""?jacky?","""a","""a""",...,ïïî³îµî¹,ïî­ïî¹,ïî±î¸îµî¯,ïïï,ïî­î¼î±,ïïî±î½,ïïî¹,öfter,über,überreden
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the-battle,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hey-jude,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
time,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
we-can-make-the-world-a-whole-lot-brighter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
day-by-day,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Part 5 - Cosine Similarity
### CountVectorizer

In [15]:
# Reference "Favorite Song" Row
ref = count_vecs.iloc[4500]

# Pull cosine similarities to ref song
cosines = count_vecs.dot(ref)

# Print top 5 closest songs to ref
cosines.nlargest(6) # This makes more sense when you normalize it

Title
taro                   1.000000
blue-eyes              0.195815
breathe                0.185933
space-age-love-song    0.161608
NA                     0.156716
rain-eyes              0.149050
dtype: float64

### TF-IDFVectorizer

In [16]:
# Reference "Favorite Song" Row
ref = tfidf_vecs.iloc[4500]

# Pull cosine similarities to ref song
cosines = tfidf_vecs.dot(ref)

# Print top 5 closest songs to ref
cosines.nlargest(6) # on a 1:1 ratio

Title
taro                   1.000000
a-violent-death        0.072736
blue-eyes              0.061356
die-in-your-arms       0.055682
from-rags-to-riches    0.053660
pig                    0.050556
dtype: float64

## Part 7

Enter your opinions here. Less than 50 words.