# Text Retrieval
There are 2 standard models for retrieving text data.
1. Boolean Retrieval Model
2. Vector Space Model

The aim of any information retrieval model is to retrieve documents related to a query.

## 1. Boolean Retrieval Model
In this model we consider every query and document as a set of words and we retrieve a document if and only if the query word is present in it. Model can be extended to support complex queries with boolean operators.

In this assignment we are going to implement both the models, using scikit-learn package. We are going to use song lyrics dataset.


**Step 1. Import necessary packages -- numpy and pandas - 1 Mark** 

In [1]:
import numpy as np
import pandas as pd


**Step 2. Read the dataset and store it in variable 'df' - 1 mark** <br> 

The lyric column of the dataset has song lyrics. We aim to give some lyrics as a query and retrieve the song name. 


In [2]:
df=pd.read_csv('modified_song_lyrics.csv')
df.head()


Unnamed: 0,album,track_title,lyric,year
0,Taylor Swift,Tim McGraw,He said the way my blue eyes shined Put those ...,2006
1,Taylor Swift,Picture To Burn,"State the obvious, I didn't get my perfect fan...",2006
2,Taylor Swift,Teardrops On My Guitar,Drew looks at me I fake a smile so he won't se...,2006
3,Taylor Swift,A Place In This World,"I don't know what I want, so don't ask me Caus...",2006
4,Taylor Swift,Cold as You,You have a way of coming easily to me And when...,2006


**Documentation Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html**<br>

**Step 3**<br>
1. Import this class
2. Create a 'vectorizer' object of 'CountVectorizer' with parameter binary=True - 1 Mark

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True)


We aim to analyze the lyrics for presence or absence. <br> 
**Step 4. Fit and transform the lyric column using vectorizer - 2 Marks**<br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is 0 or 1 if the word in present in this song. Verify this using X.shape method

In [4]:
X = vectorizer.fit_transform(df['lyric'].tolist())
X.shape


(94, 2301)

In [5]:
query1 = 'beautiful'
query2 = 'girl'
# To get list of all doc containing a word, we do it in the following way
list_q1 = X[:,vectorizer.vocabulary_[query1]]
# Step 5. Do the same for 'query2' and store it in 'list_q2'
list_q2 = X[:,vectorizer.vocabulary_[query2]]

In [6]:
# AND Operation
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 and list_q2[i]==1:
        print(df.iloc[i,1])

Teardrops On My Guitar
Superman
End Game (Ft. Ed Sheeran & Future)


**Step 6. Implement OR operation - 1 Mark**

In [7]:
# OR Operation
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 or list_q2[i]==1:
        print(df.iloc[i,1])


Teardrops On My Guitar
A Place In This World
Stay Beautiful
Mary's Song (Oh My My My)
I'm Only Me When I'm With You
Invisible
Fifteen
Hey Stephen
White Horse
You Belong With Me
The Way I Loved You
Back To December
Speak Now
Dear John
Innocent
Last Kiss
Superman
Holy Ground
Sad Beautiful Tragic
Everything Has Changed (Ft. Ed Sheeran)
Begin Again
Girl at Home
Blank Space
Style
How You Get The Girl
End Game (Ft. Ed Sheeran & Future)
So It Goes...
King of My Heart


## 2. Vector Space Model
In this model, every document and query is represented as a vector and closest vector as measured by cosine distance is considered as the correct answer.

**Documentation Reference:**<br>
1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

**Step 1. Import above references - 1 Mark**

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

**Step 2. Create a 'vectorizer' object of 'TfidfVectorizer' - 1 Mark**

In [9]:
vectorizer = TfidfVectorizer()


Here we attempt to calculate tf-idf scores of the terms (lyrics). We do that by doing the following. <br> 
**Step 3. Fit and transform the lyric column using vectorizer - 2 Marks**<br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is tf-idf score of the word in this song. Verify this using X.shape method

In [10]:
X = vectorizer.fit_transform(df['lyric'].tolist())
X.shape


(94, 2301)

**Step 4. Use 'transform' method of vectorizer on 'query' and store in 'query_vec' - 1 Mark**<br>
This method converts a text value into a tf-idf vector

In [11]:
query = "Take it easy, with me"
query_vec = vectorizer.transform([query])
query_vec

<1x2301 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

**Step 5. Use 'cosine_similarity' on 'X' and 'query_vec' store it in 'results' - 1 Mark**

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
results = cosine_similarity(X, query_vec)


In [13]:
# Print Name of the song
song_index = np.argmax(results.reshape((-1,)))
print('Song Index -- ',song_index) # add song name here 
print('Title -- ',df['track_title'].loc[song_index])
print('Album -- ',df['album'].loc[song_index])
print('Lyrics -- ',df['lyric'].loc[song_index])

Song Index --  20
Title --  Breathe (Ft. Colbie Caillat)
Album --  Fearless
Lyrics --  I see your face in my mind as I drive away 'Cause none of us thought it was Going to end that way People are people And sometimes we change our minds But it's killing me to see you go after all this time Mm mm mm, mm mm mm, mm mm Mm mm mm, mm mm mm, mm mm Music starts playing like the end of a sad movie It's the kind of ending you Don't really want to see 'Cause it's tragedy and it'll only bring you Down Now I don't know what to be without you around And we know it's never simple, never easy Never a clean break, no one here to Save me You're the only thing I know like the back of my hand And I can't Breathe Without you but I have to breathe Without you but I have to Never wanted this, never want to see you hurt Every little bump in the road I tried to swerve But people are people And sometimes it doesn't work out Nothing we say is gonna save us from the fall out It's 2 A.M Feeling like I just lost a 

In [14]:
print('COMPLETE')

COMPLETE
