# Text Similarity Measures Exercises #

## Introduction ##

We will be using [a song lyric dataset from Kaggle](https://www.kaggle.com/mousehead/songlyrics) to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.

In [1]:
import nltk
import pandas as pd

In [2]:
data = pd.read_csv('../data/songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


## Question 1 ##

* Filter the lyrics data set to only select songs by The Beatles.
* How many songs are there in total by The Beatles?
* Take a look at the first song's lyrics.

In [3]:
# Only look at songs by The Beatles
beatles = data[data.artist=='The Beatles']
beatles.head()

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \nAnd yo..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \nEndless rain int...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \nAll I gott..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \nThat's all I do \nA...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...


In [4]:
# Check the number of Beatles songs in the data set
data.artist.value_counts().loc[['The Beatles']]

The Beatles    178
Name: artist, dtype: int64

In [5]:
# Take a look at the first song's lyrics
beatles.iloc[0,3]

"Well, if your hands start a-clappin'  \nAnd your fingers start a-poppin'  \nAnd your feet start a-movin' around  \nAnd if you start to swing and sway  \n  \nWhen the band starts to play  \nA real cool way out sound  \nAnd if you get to can't help it and you can't sit down  \nYou feel like you gotta move around  \n  \nYou get a shot of rhythm and blues.  \nWith just a little rock and roll on the side  \nJust for good measure.  \nGet a pair of dancin' shoes  \n  \nWell, with your lover by your side  \nDon't you know you're gonna have a rockin' time, see'mon!  \nDon't you worry 'bout a thing  \nIf you start to dance and sing  \n  \nAnd chills come up on you  \nAnd if the rhythm finally gets you and the beat gets you too  \nWell, here's something for you to do  \n  \nGet a shot of rhythm and blues  \nWith just a little rock and roll on the side  \nJust for good measure  \nGet a pair of dancin' shoes  \n  \nWell, with your lover by your side  \nDon't you know you're gonna have a rockin' ti

## Question 2 ##

Apply the following preprocessing steps:
* Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.
* Remove all words with numbers using regular expressions.
* Create a document-term matrix using Count Vectorizer, with each row as a song and each column as a word in the lyrics. Have the Count Vectorizer remove all stop words as well.

Note: Count Vectorizer automatically removes punctuation and makes all characters lowercase.

In [6]:
# Remove characters from the lyrics
import re

newline = lambda x: re.sub('\n', ' ', x) # remove \n
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x) # remove alphanumeric words

corpus = beatles.text.map(newline).map(alphanumeric)
corpus.head()

1198    Well, if your hands start a-clappin'   And you...
1199    Words are flowing out like   Endless rain into...
1200    Whenever I want you around, yeah   All I gotta...
1201    I give her all my love   That's all I do   And...
1202    You tell me that you've got everything you wan...
Name: text, dtype: object

In [7]:
# Create a document term matrix using Count Vectorizer with the stop words turned on to English
from sklearn.feature_extraction.text import CountVectorizer
          
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()

dt = pd.DataFrame(X, columns=cv.get_feature_names()).set_index(beatles.song)
dt.head()

Unnamed: 0_level_0,aaahhh,aah,abc,aches,aching,acquainted,act,actors,acts,add,...,yes,yesterday,yoko,young,younger,youre,youu,zealand,zoo,zu
song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Shot Of Rhythm And Blues,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Across The Universe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
All I've Got To Do,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
And I Love Her,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
And Your Bird Can Sing,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Question 3 ##

* Take a look at the lyrics for the song "Imagine".
* Which song is the most similar to the song "Imagine"?
     * Use cosine similarity to calculate the similarity
     * Use Count Vectorizer to numerically encode the lyrics
* Find the most similar song using the TF-IDF Vectorizer.

Compare the most similar song of the outputs of both the Count Vectorizer and the TF-IDF Vectorizer.

In [8]:
# Set display options so that the lyrics aren't cut off
pd.set_option('display.max_colwidth', -1)

In [9]:
# Imagine lyrics
beatles[beatles.song=='Imagine'].text

24783    Imagine there's no heaven  \nIt's easy if you try  \nNo hell below us  \nAbove us only sky  \nImagine all the people  \nLiving for today...  \n  \nImagine there's no countries  \nIt isn't hard to do  \nNothing to kill or die for  \nAnd no religion too  \nImagine all the people  \nLiving life in peace...  \n  \nYou may say I'm a dreamer  \nBut I'm not the only one  \nI hope someday you'll join us  \nAnd the world will be as one  \n  \nImagine no possessions  \nI wonder if you can  \nNo need for greed or hunger  \nA brotherhood of man  \nImagine all the people  \nSharing all the world...  \n  \nYou may say I'm a dreamer  \nBut I'm not the only one  \nI hope someday you'll join us  \nAnd the world will live as one\n\n
Name: text, dtype: object

In [10]:
# Imagine lyrics in Count Vectorizer form
imagine = list(dt.loc['Imagine'])
imagine[:20]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [11]:
# Define the cosine similarity calculation
from numpy import dot
from numpy.linalg import norm

cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

In [12]:
# Calculate all similarities and sort by the most similar
similarity = [cosine(imagine, song) for song in X]
sorted(list(zip(similarity, beatles.song)), reverse=True)[1:6]

[(0.22757944185316942, "I'll Cry Instead"),
 (0.1738270040532526, 'In My Life'),
 (0.16574838603294895, 'Eleanor Rigby'),
 (0.14902808318498439, 'In Spite Of All The Danger'),
 (0.14592261028699943, 'All My Loving')]

In [13]:
# Create the TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
          
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()

dt_tfidf = pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names()).set_index(beatles.song)
dt_tfidf.head()

Unnamed: 0_level_0,aaahhh,aah,abc,aches,aching,acquainted,act,actors,acts,add,...,yes,yesterday,yoko,young,younger,youre,youu,zealand,zoo,zu
song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A Shot Of Rhythm And Blues,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Across The Universe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
All I've Got To Do,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.120483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
And I Love Her,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
And Your Bird Can Sing,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Calculate all similarities and sort by the most similar using the TF-IDF data
similarity_tfidf = [cosine(imagine, song) for song in X_tfidf]
sorted(list(zip(similarity_tfidf, beatles.song)), reverse=True)[1:6]

[(0.19266277925834221, "I'll Cry Instead"),
 (0.18004423848592196, "I'll Get You"),
 (0.15513717441169331, 'Eleanor Rigby'),
 (0.15054395512050198, 'In My Life'),
 (0.12306875209548619, 'Love Of The Loved')]

The answers for the Count Vectorizer and TF-IDF Vectorizer are pretty close. In both cases, "I'll Cry Instead" is the most similar song to "Imagine". When you look at the lyrics, you can see that they are both on the sad side for Beatles songs.

In [15]:
# Imagine lyrics
beatles[beatles.song=='Imagine'].text

24783    Imagine there's no heaven  \nIt's easy if you try  \nNo hell below us  \nAbove us only sky  \nImagine all the people  \nLiving for today...  \n  \nImagine there's no countries  \nIt isn't hard to do  \nNothing to kill or die for  \nAnd no religion too  \nImagine all the people  \nLiving life in peace...  \n  \nYou may say I'm a dreamer  \nBut I'm not the only one  \nI hope someday you'll join us  \nAnd the world will be as one  \n  \nImagine no possessions  \nI wonder if you can  \nNo need for greed or hunger  \nA brotherhood of man  \nImagine all the people  \nSharing all the world...  \n  \nYou may say I'm a dreamer  \nBut I'm not the only one  \nI hope someday you'll join us  \nAnd the world will live as one\n\n
Name: text, dtype: object

In [16]:
# I'll Cry Instead lyrics
beatles[beatles.song=='I\'ll Cry Instead'].text

24774    I've got every reason on earth to be mad  \n'Cause I just lost the only girl I had  \nIf I could get my way  \nI'd get myself locked up today  \nBut I can't, so I'll cry instead  \n  \nI've got a chip on my shoulder that's bigger that my feet  \nI can't talk to people that I meet  \nIf I could see you now  \nI'd try to make you sad somehow  \nBut I can't, so I'll cry instead  \n  \nDon't want to cry when there's people there  \nI get shy when they start to stare  \nI'm gonna hide myself away  \nBut I'll come back again someday  \n  \nAnd when you do you'd better hide all the girls  \nI'm gonna break their hearts all round the world  \nYes, I'm gonna break them in two  \nAnd show you what your lovin' man can do  \nUntil then I'll cry instead\n\n
Name: text, dtype: object

## Question 4 ##

Which two Beatles songs are the most similar?
   * Using Count Vectorizer
   * Using TF-IDF Vectorizer
     
Compare the results. Which Vectorizer seems to do a better job?

In [17]:
# Calculate the cosine similarity between all combinations of documents

In [18]:
# List all the combinations of songs
from itertools import combinations

pairs = list(combinations(beatles.song.index, 2)) # all song index combos
pairs_0 = list(combinations(range(len(beatles)), 2)) # all index combos starting with (0,1)
song_pairs = [(beatles.song[a_index], beatles.song[b_index]) for (a_index, b_index) in pairs]
song_pairs[:5]

[('A Shot Of Rhythm And Blues', 'Across The Universe'),
 ('A Shot Of Rhythm And Blues', "All I've Got To Do"),
 ('A Shot Of Rhythm And Blues', 'And I Love Her'),
 ('A Shot Of Rhythm And Blues', 'And Your Bird Can Sing'),
 ('A Shot Of Rhythm And Blues', 'Another Girl')]

In [19]:
# Calculate the cosine similarity of the Count Vectorizer document-term matrix
results = [cosine(X[a_index], X[b_index]) for (a_index, b_index) in pairs_0]
sorted(zip(results, song_pairs), reverse=True)[:5]

[(0.85488763420475233, ('All You Need Is Love', 'Love Me Do')),
 (0.83591731886302922, ('If I Needed Someone', 'If I Needed Someone To Love')),
 (0.8058125835571367, ('And I Love Her', 'Love Me Do')),
 (0.79301103158466191, ('If I Fell', 'Love Me Do')),
 (0.74783913827394188, ('And I Love Her', 'All You Need Is Love'))]

In [20]:
# Calculate the cosine similarity of the TF-IDF Vectorizer document-term matrix
results = [cosine(X_tfidf[a_index], X_tfidf[b_index]) for (a_index, b_index) in pairs_0]
sorted(zip(results, song_pairs), reverse=True)[:5]

[(0.87170477832615245, ('If I Needed Someone', 'If I Needed Someone To Love')),
 (0.65993535683431048, ('All You Need Is Love', 'Love Me Do')),
 (0.65028425178838345, ('Have A Banana!', 'Crinsk Dee Night')),
 (0.62327973273254611, ("Don't Let Me Down", 'Let It Be')),
 (0.52753643363228631, ('If I Fell', 'Love Me Do'))]

With the Count Vectorizer, all of the top songs contain the term "love", with "All You Need Is Love" and "Love Me Do" being the most similar. This is just because The Beatles tend to use the term "love" a lot in their songs.

With the TF-IDF Vectorizer, the top similar songs are a bit more interesting. It looks like it actually picked up a duplicate song in the lyrics dataset with "If I Needed Someone" and "If I Needed Someone To Love". Also, "Have a Banana!" and "Crinsk Dee Night" are both conversations with host Brian Matthew.

In [21]:
# All You Need Is Love lyrics
beatles[beatles.song=='All You Need Is Love'].text

24695    Love, love, love, love, love, love, love, love, love.  \n  \nThere's nothing you can do that can't be done.  \nNothing you can sing that can't be sung.  \nNothing you can say, but you can learn  \nHow to play the game  \nIt's easy.  \nNothing you can make that can't be made.  \nNo one you can save that can't be saved.  \nNothing you can do, but you can learn  \nHow to be you in time  \nIt's easy.  \n  \nAll you need is love, all you need is love,  \nAll you need is love, love. Love is all you need.  \nLove, love, love, love, love, love, love, love, love.  \nAll you need is love, all you need is love,  \nAll you need is love, love. Love is all you need.  \n  \nThere's nothing you can know that isn't known.  \nNothing you can see that isn't shown.  \nThere's nowhere you can be that isn't where  \nYou're meant to be  \nIt's easy.  \n  \nAll you need is love, all you need is love,  \nAll you need is love, love. Love is all you need.  \nAll you need is love. (All together now).  \n

In [22]:
# Love Me Do lyrics
beatles[beatles.song=='Love Me Do'].text

24805    Love, love me do.  \nYou know I love you,  \nI'll always be true,  \nSo please, love me do.  \nWhoa, love me do.  \n  \nLove, love me do.  \nYou know I love you,  \nI'll always be true,  \nSo please, love me do.  \nWhoa, love me do.  \n  \nSomeone to love,  \nSomebody new.  \nSomeone to love,  \nSomeone like you.  \n  \nLove, love me do.  \nYou know I love you,  \nI'll always be true,  \nSo please, love me do.  \nWhoa, love me do.  \n  \nLove, love me do.  \nYou know I love you,  \nI'll always be true,  \nSo please, love me do.  \nWhoa, love me do.  \nYeah, love me do.  \nWhoa, oh, love me do.\n\n
Name: text, dtype: object

In [23]:
# If I Needed Someone lyrics
beatles[beatles.song=='If I Needed Someone'].text

24768    If I needed someone to love  \nYou're the one that I'd be thinking of  \nIf I needed someone  \n  \nIf I had some more time to spend  \nThen I guess I'd be with you my friend  \nIf I needed someone  \nHad you come some other day  \nThen it might not have been like this  \nBut you see now I'm too much in love  \n  \nCarve your number on my wall  \nAnd maybe you will get a call from me  \nIf I needed someone  \nAh, ah, ah, ah  \n  \nIf I had some more time to spend  \nThen I guess I'd be with you my friend  \nIf I needed someone  \nHad you come some other day  \nThen it might not have been like this  \nBut you see now I'm too much in love  \n  \nCarve your number on my wall  \nAnd maybe you will get a call from me  \nIf I needed someone  \nAh, ah\n\n
Name: text, dtype: object

In [24]:
# If I Needed Someone To Love lyrics
beatles[beatles.song=='If I Needed Someone To Love'].text

24769    If I needed someone to love  \nYoure the woman I'd be thinking of  \nIf I needed someone  \n  \nIf I had some more time to spend  \nThen I guess I'd be with you my friend  \nIf I needed someone  \n  \nHad you come some other day  \nThen it might not have been like this  \nBut you see now I'm too much in love  \n  \nCarve your number on my wall  \nAnd maybe you will get a call from me  \nIf I needed someone  \n  \nIf I had some more time to spend  \nThen I guess I'd be with you my friend  \nIf I needed someone  \n  \nHad you come some other day  \nThen it might not have been like this  \nBut you see now I'm too much in love  \n  \nCarve your number on my wall  \nAnd maybe you will get a call from me  \nIf I needed someone  \nIf I needed someone to love  \nYoure the woman I'd be thinking of  \nIf I needed someone  \n  \nIf I had some more time to spend  \nThen I guess I'd be with you my friend  \nIf I needed someone  \n  \nHad you come some other day  \nThen it might not have be

In [25]:
# Have a Banana! lyrics
beatles[beatles.song=='Have A Banana!'].text

1221    [Speech]  \n  \nBrian Matthew: Is that it? Is that the end?  \nPaul: Yeah, yeah, that's it.  \nJohn: Fade, fade!  \nBrian: Good track. Oh, well, we'll stop there, stop there, stop there.  \nJohn: What an end!  \nBrian: Quiet! All right, George.  \nJohn: Fade!  \nBrian: Hold it!  \nGeorge: Oh, thank you.  \nJohn: Fade, you silly.  \nBrian: Well, we did. We did that. Oh, no! No! We've done that bit!  \nJohn: The train comes in now.  \nBrian: We did that.  \nJohn: Yeah.  \nBrian: To pove we weren't playing the record, then, you see. 'Cause,\notherwise, there's no point in you being here, is there? Ha, ha, ha!  \nJohn: Yeah, we did that, 'cause it sounds just like it, don't it?  \nBrian: Pretty cool lot of fellows, aren't you? Here, Ringo, have a banana,\ncatch!\n\n
Name: text, dtype: object

In [26]:
# Crinsk Dee Night lyrics
beatles[beatles.song=='Crinsk Dee Night'].text

24713    [Speech]  \n  \nBrian Matthew: The next few minutes, we're in the lap of the gods and the\nhands of the Beatles. In my young days, when I was a lad, they used to have\nactors in films and now that they--  \nPaul: Yes?  \nJohn: Hey! Listen!  \nPaul: It's all changed, now, Brian. They're not doing that, no actors.  \nJohn: It's all changed, now.  \nBrian: But this is what I wonder. In those days, the actors used to say their\nbest bits were left on the cutting room floor. Did you find that?  \nJohn: No, no, no, those were the good bits in the film. You should have seen\nthe rest.  \nBrian: Yes?  \nJohn: Rubbish!  \nBrian: Was it, really?  \nJohn: Even worse, yes.  \nBrian: Who was worst?  \nJohn: Oh, Paul.  \nBrian: I see.  \nPaul: I think John was about the worst.  \nJohn: No, it was you.  \nPaul: Oh, Ringo was very good. He was. He's a good lad.  \nBrian: He was. They're saying he's a new Charlie Chaplin. Do you think that's\nright?  \nJohn: He was miming.  \nPaul: You, too, w