# Text Similarity Measures Exercises #

## Introduction ##

We will be using [a song lyric dataset from Kaggle](https://www.kaggle.com/mousehead/songlyrics) to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
data = pd.read_csv('C:/Users/gugha/Documents/UIC/2nd_Semester/Adv Text Analytics/Assignment1/songdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


## Question 1 ##

* Filter the lyrics data set to only select songs by The Beatles.
* How many songs are there in total by The Beatles?
* Take a look at the first song's lyrics.

In [17]:
beatles = data[data.artist=="The Beatles"]
print("Number of Beatles Songs:",len(beatles))

Number of Beatles Songs: 178


In [18]:
#First Song
print(beatles.text.iloc[0])

Well, if your hands start a-clappin'  
And your fingers start a-poppin'  
And your feet start a-movin' around  
And if you start to swing and sway  
  
When the band starts to play  
A real cool way out sound  
And if you get to can't help it and you can't sit down  
You feel like you gotta move around  
  
You get a shot of rhythm and blues.  
With just a little rock and roll on the side  
Just for good measure.  
Get a pair of dancin' shoes  
  
Well, with your lover by your side  
Don't you know you're gonna have a rockin' time, see'mon!  
Don't you worry 'bout a thing  
If you start to dance and sing  
  
And chills come up on you  
And if the rhythm finally gets you and the beat gets you too  
Well, here's something for you to do  
  
Get a shot of rhythm and blues  
With just a little rock and roll on the side  
Just for good measure  
Get a pair of dancin' shoes  
  
Well, with your lover by your side  
Don't you know you're gonna have a rockin' tim

## Question 2 ##

Apply the following preprocessing steps:
* Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.
* Remove all words with numbers using regular expressions.
* Create a document-term matrix using Count Vectorizer, with each row as a song and each column as a word in the lyrics. Have the Count Vectorizer remove all stop words as well.

Note: Count Vectorizer automatically removes punctuation and makes all characters lowercase.

In [148]:
beatles.text = beatles.text.apply(lambda x: re.sub(r"\n"," ",x))
beatles.text = beatles.text.apply(lambda x: re.sub(r"[^\s]*\d[^\s]*","",x))
vectorizer = CountVectorizer(stop_words='english')
# tokenize and build vocab
vectorizer.fit(beatles.text)
dtm = vectorizer.transform(beatles.text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [149]:
vectorizer = CountVectorizer(stop_words='english')
# tokenize and build vocab
vectorizer.fit(beatles.text)
dtm = vectorizer.transform(beatles.text)

In [150]:
dtm_beatles = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())
dtm_beatles["song"] = beatles.song.reset_index(drop=True)

## Question 3 ##

* Take a look at the lyrics for the song "Imagine".
* Which song is the most similar to the song "Imagine"?
     * Use cosine similarity to calculate the similarity
     * Use Count Vectorizer to numerically encode the lyrics
* Find the most similar song using the TF-IDF Vectorizer.

Compare the most similar song of the outputs of both the Count Vectorizer and the TF-IDF Vectorizer.

In [151]:
similarity1 = cosine_similarity(dtm_beatles[dtm_beatles.song!="Imagine"].drop("song",axis=1),dtm_beatles[dtm_beatles.song=="Imagine"].drop("song",axis=1))

In [152]:
dtm_beatles.song[dtm_beatles.song!="Imagine"].iloc[np.argmax(similarity1)]

"I'll Cry Instead"

In [153]:
vectorizer = TfidfVectorizer(stop_words='english')
# tokenize and build vocab
vectorizer.fit(beatles.text)
dtm = vectorizer.transform(beatles.text)
dtmtf_beatles = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names())
dtmtf_beatles["song"] = beatles.song.reset_index(drop=True)

In [154]:
similarity = cosine_similarity(dtmtf_beatles[dtmtf_beatles.song!="Imagine"].drop("song",axis=1),dtmtf_beatles[dtmtf_beatles.song=="Imagine"].drop("song",axis=1))
dtmtf_beatles.song[dtmtf_beatles.song!="Imagine"].iloc[np.argmax(similarity)]

"I'll Get You"

## Question 4 ##

Which two Beatles songs are the most similar?
   * Using Count Vectorizer
   * Using TF-IDF Vectorizer
     
Compare the results. Which Vectorizer seems to do a better job?

In [155]:
print("Count Vectorizer similarity score:",np.max(similarity1))
print("TF-IDF Vectorizer similarity score:",np.max(similarity))

Count Vectorizer similarity score: 0.22757944185316942
TF-IDF Vectorizer similarity score: 0.15611650539481658


* Similarity Score in Count Vectorizer is high hence it does a better job