# Text Similarity Measures Exercises #

## Introduction ##

We will be using [a song lyric dataset from Kaggle](https://www.kaggle.com/mousehead/songlyrics) to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help us load in the data and get set up for this exercise.

In [24]:
import nltk
import pandas as pd

In [25]:
data = pd.read_csv('songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


## Task 1 ##

* Filter the lyrics data set to only select songs by The Beatles.
* How many songs are there in total by The Beatles?
* Take a look at the first song's lyrics.

In [26]:
beatles = data.loc[data['artist']=='The Beatles']
beatles

Unnamed: 0,artist,song,link,text
1198,The Beatles,A Shot Of Rhythm And Blues,/b/beatles/a+shot+of+rhythm+blues_20014867.html,"Well, if your hands start a-clappin' \r\nAnd ..."
1199,The Beatles,Across The Universe,/b/beatles/across+the+universe_10026507.html,Words are flowing out like \r\nEndless rain i...
1200,The Beatles,All I've Got To Do,/b/beatles/all+ive+got+to+do_10026646.html,"Whenever I want you around, yeah \r\nAll I go..."
1201,The Beatles,And I Love Her,/b/beatles/and+i+love+her_10026463.html,I give her all my love \r\nThat's all I do \...
1202,The Beatles,And Your Bird Can Sing,/b/beatles/and+your+bird+can+sing_10026364.html,You tell me that you've got everything you wan...
1203,The Beatles,Another Girl,/b/beatles/another+girl_10026200.html,"For I have got another girl, another girl \r\..."
1204,The Beatles,Any Time At All,/b/beatles/any+time+at+all_10025891.html,"Any time at all, any time at all, any time at ..."
1205,The Beatles,Ask Me Why,/b/beatles/ask+me+why_10025893.html,"I love you, 'cause you tell me things I want t..."
1206,The Beatles,"Baby, You're A Rich Man",/b/beatles/baby+youre+a+rich+man_10026560.html,How does it feel to be \r\nOne of the beautif...
1207,The Beatles,Birthday,/b/beatles/birthday_10025908.html,You say it's your birthday \r\nIt's my birthd...


In [27]:
len (beatles.index)

178

In [28]:
beatles.iloc[0]['text']

"Well, if your hands start a-clappin'  \r\nAnd your fingers start a-poppin'  \r\nAnd your feet start a-movin' around  \r\nAnd if you start to swing and sway  \r\n  \r\nWhen the band starts to play  \r\nA real cool way out sound  \r\nAnd if you get to can't help it and you can't sit down  \r\nYou feel like you gotta move around  \r\n  \r\nYou get a shot of rhythm and blues.  \r\nWith just a little rock and roll on the side  \r\nJust for good measure.  \r\nGet a pair of dancin' shoes  \r\n  \r\nWell, with your lover by your side  \r\nDon't you know you're gonna have a rockin' time, see'mon!  \r\nDon't you worry 'bout a thing  \r\nIf you start to dance and sing  \r\n  \r\nAnd chills come up on you  \r\nAnd if the rhythm finally gets you and the beat gets you too  \r\nWell, here's something for you to do  \r\n  \r\nGet a shot of rhythm and blues  \r\nWith just a little rock and roll on the side  \r\nJust for good measure  \r\nGet a pair of dancin' shoes  \r\n  \r\nWell, with your lover by 

## Task 2 ##

Apply the following preprocessing steps:
* Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.
* Remove all words with numbers using regular expressions.
* Create a document-term matrix using Count Vectorizer, with each row as a song and each column as a word in the lyrics. Have the Count Vectorizer remove all stop words as well.

Note: Count Vectorizer automatically removes punctuation and makes all characters lowercase.

In [29]:
lyrics = data.loc[:,'text']
filter_lyrics = lyrics.replace('(\r\n|\n|\r)'," ");
print (filter_lyrics)

0        Look at her face, it's a wonderful face  \r\nA...
1        Take it easy with me, please  \r\nTouch me gen...
2        I'll never know why I had to go  \r\nWhy I had...
3        Making somebody happy is a question of give an...
4        Making somebody happy is a question of give an...
5        Well, you hoot and you holler and you make me ...
6        Down in the street they're all singing and sho...
7        Chiquitita, tell me what's wrong  \r\nYou're e...
8        I was out with the morning sun  \r\nCouldn't s...
9        I'm waitin' for you baby  \r\nI'm sitting all ...
10       Oh, my love it makes me sad.  \r\nWhy did thin...
11       You can dance, you can jive, having the time o...
12       Changing, moving in a circle  \r\nI can see yo...
13       You're so hot, teasing me  \r\nSo you're blue ...
14       Agnetha We're not the stars of a Hollywood mov...
15       I can hear how you work, practising hard  \r\n...
16       They came flying from far away, now I'm under .

In [30]:
import numpy as np
import re
replace_n = lambda x : re.sub('\n','', x)
beatles['text'] = list(map(replace_n,beatles['text']))
beatles['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


1198     Well, if your hands start a-clappin'  \rAnd yo...
1199     Words are flowing out like  \rEndless rain int...
1200     Whenever I want you around, yeah  \rAll I gott...
1201     I give her all my love  \rThat's all I do  \rA...
1202     You tell me that you've got everything you wan...
1203     For I have got another girl, another girl  \rY...
1204     Any time at all, any time at all, any time at ...
1205     I love you, 'cause you tell me things I want t...
1206     How does it feel to be  \rOne of the beautiful...
1207     You say it's your birthday  \rIt's my birthday...
1208     Blackbird singing in the dead of night  \rTake...
1209     There's a fog upon L.A.  \rAnd my friends have...
1210     You can knock me down,  \rSlap my face  \rSlan...
1211     I been told when a boy kiss a girl,  \rTake a ...
1212     Can't buy me love, love  \rCan't buy me love  ...
1213     Oh Carol, don't let him  \rSteal your heart aw...
1214     [Speech]  \r  \rBrian Matthew: But despite the.

In [31]:
replace_num = lambda x : re.sub('\w*\d\w*','', x)
beatles['text'] = list(map(replace_num, beatles['text']))
beatles['text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


1198     Well, if your hands start a-clappin'  \rAnd yo...
1199     Words are flowing out like  \rEndless rain int...
1200     Whenever I want you around, yeah  \rAll I gott...
1201     I give her all my love  \rThat's all I do  \rA...
1202     You tell me that you've got everything you wan...
1203     For I have got another girl, another girl  \rY...
1204     Any time at all, any time at all, any time at ...
1205     I love you, 'cause you tell me things I want t...
1206     How does it feel to be  \rOne of the beautiful...
1207     You say it's your birthday  \rIt's my birthday...
1208     Blackbird singing in the dead of night  \rTake...
1209     There's a fog upon L.A.  \rAnd my friends have...
1210     You can knock me down,  \rSlap my face  \rSlan...
1211     I been told when a boy kiss a girl,  \rTake a ...
1212     Can't buy me love, love  \rCan't buy me love  ...
1213     Oh Carol, don't let him  \rSteal your heart aw...
1214     [Speech]  \r  \rBrian Matthew: But despite the.

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
cvctor = CountVectorizer(stop_words='english')
trans = cvctor.fit_transform(beatles.text)
out1 = pd.DataFrame(trans.toarray(), columns=cvctor.get_feature_names())
out2=pd.DataFrame(beatles.song)
doc_term_matrix=pd.concat([out2.reset_index(drop=True),out1],axis=1)
doc_term_matrix

Unnamed: 0,song,aaahhh,aah,abc,aches,aching,acquainted,act,actors,acts,...,yes,yesterday,yoko,young,younger,youre,youu,zealand,zoo,zu
0,A Shot Of Rhythm And Blues,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Across The Universe,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,All I've Got To Do,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
3,And I Love Her,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,And Your Bird Can Sing,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Another Girl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Any Time At All,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Ask Me Why,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Baby, You're A Rich Man",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0
9,Birthday,0,0,0,0,0,0,0,0,0,...,3,0,0,0,0,0,0,0,0,0


## Task 3

* Take a look at the lyrics for the song "Imagine".
* Which song is the most similar to the song "Imagine"?
     * Use cosine similarity to calculate the similarity
     * Use Count Vectorizer to numerically encode the lyrics
* Find the most similar song using the TF-IDF Vectorizer.

Compare the most similar song of the outputs of both the Count Vectorizer and the TF-IDF Vectorizer.

In [33]:
Img = beatles.loc[beatles['song']=="Imagine"]
Img_text=Img.iloc[0]['text']
Img

Unnamed: 0,artist,song,link,text
24783,The Beatles,Imagine,/b/beatles/imagine_20254326.html,Imagine there's no heaven \rIt's easy if you ...


In [34]:
from sklearn.metrics.pairwise import cosine_similarity
train_data = beatles.loc[beatles['song']!='Imagine']
cvector = CountVectorizer(stop_words='english')
train_text = cvector.fit_transform(train_data.text)
test_text = cvector.transform(Img.text)
out_df = pd.DataFrame(cosine_similarity(test_text, train_text), columns=train_data.text.index)
out_df

Unnamed: 0,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,...,24821,24822,24823,24824,24825,24826,24827,24828,24829,24830
0,0.0,0.142125,0.070179,0.042681,0.120263,0.05222,0.131824,0.117901,0.09526,0.048791,...,0.028675,0.053262,0.0,0.0,0.008694,0.103142,0.01891,0.0,0.109589,0.084806


In [35]:
a = out_df.idxmax(axis=1)
similar_i=beatles['song'][a]
print(similar_i)
print(out_df[a])

24774    I'll Cry Instead
Name: song, dtype: object
      24774
0  0.240527


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = beatles.loc[beatles['song']!='Imagine']
tf_idf = TfidfVectorizer(stop_words='english')
train_text = tf_idf.fit_transform(train_data.text)
test_text = tf_idf.transform(Img.text)
out_idf = pd.DataFrame(cosine_similarity(test_text, train_text), columns=train_data.text.index)
out_idf

Unnamed: 0,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,...,24821,24822,24823,24824,24825,24826,24827,24828,24829,24830
0,0.0,0.082225,0.018982,0.061435,0.049112,0.041295,0.070755,0.040354,0.055601,0.012467,...,0.009923,0.037606,0.0,0.0,0.006496,0.065503,0.010212,0.0,0.033995,0.049607


In [37]:
song_2 = out_idf.idxmax(axis=1)
similar_i2=beatles['song'][song_2]
print(similar_i2)
print(out_df[song_2])

24775    I'll Get You
Name: song, dtype: object
      24775
0  0.151338


## Task 4 ##

Which two Beatles songs are the most similar?
   * Using Count Vectorizer
   * Using TF-IDF Vectorizer
     
Compare the results. Which Vectorizer seems to do a better job?

In [38]:
cvec = CountVectorizer(stop_words='english')
train_text = cvec.fit_transform(beatles.text)
test_text = cvec.transform(beatles.text)
comp = pd.DataFrame(cosine_similarity(test_text, train_text), columns=beatles.text.index)
comp
comp1=pd.DataFrame(beatles.song)
count=pd.concat([comp1.reset_index(drop=True),comp],axis=1)
count


Unnamed: 0,song,1198,1199,1200,1201,1202,1203,1204,1205,1206,...,24821,24822,24823,24824,24825,24826,24827,24828,24829,24830
0,A Shot Of Rhythm And Blues,1.000000,0.079958,0.116203,0.040383,0.124134,0.090583,0.170680,0.059761,0.036482,...,0.030523,0.069293,0.089179,0.042301,0.026734,0.134186,0.022365,0.059487,0.051845,0.033434
1,Across The Universe,0.079958,1.000000,0.002534,0.043150,0.003948,0.031425,0.012526,0.018245,0.004914,...,0.007765,0.000000,0.098315,0.049896,0.032961,0.074484,0.013656,0.052969,0.004397,0.022966
2,All I've Got To Do,0.116203,0.002534,1.000000,0.016893,0.187522,0.065070,0.379888,0.166667,0.008977,...,0.042563,0.000000,0.013817,0.048262,0.028676,0.021263,0.084204,0.013825,0.156638,0.107229
3,And I Love Her,0.040383,0.043150,0.016893,1.000000,0.000000,0.139669,0.066805,0.459501,0.014559,...,0.126550,0.000000,0.033612,0.026090,0.059295,0.031035,0.007587,0.039237,0.039082,0.470665
4,And Your Bird Can Sing,0.124134,0.003948,0.187522,0.000000,1.000000,0.232555,0.306604,0.242336,0.018647,...,0.081042,0.068423,0.064576,0.091893,0.008935,0.053000,0.014575,0.043075,0.212737,0.123471
5,Another Girl,0.090583,0.031425,0.065070,0.139669,0.232555,1.000000,0.151363,0.165353,0.019793,...,0.050829,0.108936,0.079965,0.088669,0.073496,0.112509,0.036098,0.060960,0.016603,0.204288
6,Any Time At All,0.170680,0.012526,0.379888,0.066805,0.306604,0.151363,1.000000,0.285602,0.020709,...,0.074806,0.173683,0.327838,0.063617,0.093554,0.071471,0.000000,0.022780,0.246181,0.175145
7,Ask Me Why,0.059761,0.018245,0.166667,0.459501,0.242336,0.165353,0.285602,1.000000,0.039500,...,0.204302,0.052705,0.132644,0.083654,0.034411,0.045928,0.037424,0.055300,0.289178,0.397213
8,"Baby, You're A Rich Man",0.036482,0.004914,0.008977,0.014559,0.018647,0.019793,0.020709,0.039500,1.000000,...,0.097818,0.000000,0.026793,0.017331,0.100091,0.285870,0.018142,0.017872,0.020768,0.204915
9,Birthday,0.077903,0.077076,0.072421,0.009787,0.025072,0.033264,0.079552,0.067593,0.013003,...,0.000000,0.091606,0.048031,0.037282,0.029905,0.044348,0.016262,0.068083,0.000000,0.048619


In [39]:
counter = count
position = []
value = 0
for i in range(2,len(counter)):
    for j in range(1,i-1):
        if counter.iat[i, j] > value:
            position = []
            position.append([i,j])
            value = counter.iat[i,j]
        elif counter.iat[i, j] == value:
            position.append([i,j])
        
print (value, position)
for pos in position:
    print("First song: " + counter.iat[pos[0], 0] + " , Second Song: " + counter.iat[pos[1]-1, 0])

0.8548876342047521 [[152, 43]]
First song: Love Me Do , Second Song: All You Need Is Love


In [40]:
tf = TfidfVectorizer(stop_words='english')
train_text = tf.fit_transform(beatles.text)
test_text = tf.transform(beatles.text)
cosim = pd.DataFrame(cosine_similarity(test_text, train_text), columns=beatles.text.index)
cosim
cosim1=pd.DataFrame(beatles.song)
cosimv=pd.concat([cosim1.reset_index(drop=True),cosim],axis=1)
cosimv

Unnamed: 0,song,1198,1199,1200,1201,1202,1203,1204,1205,1206,...,24821,24822,24823,24824,24825,24826,24827,24828,24829,24830
0,A Shot Of Rhythm And Blues,1.000000,0.041549,0.065782,0.027494,0.076393,0.030683,0.087673,0.016675,0.017948,...,0.005914,0.026600,0.021247,0.009919,0.008978,0.047492,0.015133,0.023512,0.011493,0.025922
1,Across The Universe,0.041549,1.000000,0.002608,0.013421,0.004846,0.025213,0.007837,0.004934,0.003358,...,0.005511,0.000000,0.101910,0.012297,0.011046,0.041498,0.005296,0.022778,0.001831,0.006133
2,All I've Got To Do,0.065782,0.002608,1.000000,0.018775,0.073659,0.035255,0.430144,0.093243,0.003606,...,0.018450,0.000000,0.004005,0.017127,0.009870,0.007646,0.060386,0.004646,0.051496,0.043936
3,And I Love Her,0.027494,0.013421,0.018775,1.000000,0.000000,0.062955,0.056362,0.189572,0.004324,...,0.071442,0.000000,0.009511,0.009683,0.033542,0.010412,0.007878,0.017174,0.012203,0.177631
4,And Your Bird Can Sing,0.076393,0.004846,0.073659,0.000000,1.000000,0.114279,0.143996,0.106080,0.006311,...,0.032393,0.077484,0.020092,0.050020,0.002545,0.014686,0.008479,0.016402,0.069649,0.050928
5,Another Girl,0.030683,0.025213,0.035255,0.062955,0.114279,1.000000,0.103960,0.085938,0.008866,...,0.019979,0.060808,0.032360,0.038049,0.044164,0.074551,0.019314,0.030404,0.006338,0.116655
6,Any Time At All,0.087673,0.007837,0.430144,0.056362,0.143996,0.103960,1.000000,0.176504,0.012243,...,0.031357,0.070884,0.142615,0.055332,0.052467,0.035186,0.000000,0.009125,0.126481,0.104320
7,Ask Me Why,0.016675,0.004934,0.093243,0.189572,0.106080,0.085938,0.176504,1.000000,0.012723,...,0.096553,0.020310,0.064591,0.046265,0.015428,0.013399,0.031959,0.030629,0.162041,0.150630
8,"Baby, You're A Rich Man",0.017948,0.003358,0.003606,0.004324,0.006311,0.008866,0.012243,0.012723,1.000000,...,0.044802,0.000000,0.006229,0.005355,0.040426,0.156107,0.014570,0.008136,0.004713,0.119196
9,Birthday,0.038510,0.034683,0.032288,0.003466,0.008052,0.013499,0.030619,0.025522,0.008958,...,0.000000,0.026927,0.011587,0.008478,0.006769,0.020306,0.007692,0.029434,0.000000,0.040838


In [41]:
counter2 = cosimv
position2 = []
value2 = 0
for i in range(2,len(counter2)):
    for j in range(1,i-1):
        if counter2.iat[i, j] > value2:
            position2 = []
            position2.append([i,j])
            value2 = counter2.iat[i,j]
        elif counter2.iat[i, j] == value2:
            position2.append([i,j])
        
print (value2, position2)
for po in position2:
    print("First song: " + counter2.iat[po[0], 0] + " , Second Song: " + counter2.iat[po[1]-1, 0])

0.6599353568343106 [[152, 43]]
First song: Love Me Do , Second Song: All You Need Is Love


Same output is obtained through both the methods. However, count vectorizer has a higher similarity score