# NLP - HW6
### Miguel Bonilla

- [1. Evaluate Text Similarity](#1.-Evaluate-Text-Similarity)
- [2. Evaluate Using Major Text Engine](#2.-Evaluate-Using-Major-Text-Engine)

1. Evaluate text similarity of Amazon book search results by doing the following:  
a. Do a book search on Amazon via the search box. Manually copy the full book title 
(including subtitle) of each of the top 24 books listed in the first two pages of search 
results. You need to share the search query you use.  
b. In Python, run one of the text-similarity measures covered in this course, e.g., cosine 
similarity. Compare each of the book titles, pairwise, to every other one.  
c. Which two titles are the most similar to each other? Which are the most dissimilar? 
Where do they rank, among the first 24 results?  
2. Now evaluate using a major search engine.  
a. Enter one of the book titles from question 1a into Google, Bing, or Yahoo!. Copy the 
capsule of the first organic result and the 20th organic result. Take web results only (i.e.,
not video results), and skip sponsored results.  
b. Run the same text similarity calculation that you used for question 1b on each of these 
capsules in comparison to the original query (book title).  
c. Which one has the highest similarity measure?  

Submit all of your inputs and outputs and your code for this assignment, along with a brief written 
explanation of your findings. 

In [1]:
import pandas as pd
import nltk
import numpy as np
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

### 1. Evaluate Text Similarity

Titles saved manually from an Amazon.com books search for "Latin American Literature". The first 24 non-sponsored results were saved into a csv file and then loaded into a pandas dataframe.

In [22]:
titles = pd.read_csv('https://raw.githubusercontent.com/boneeyah/DS7337/main/Book_titles.csv')
titles = titles['Book title'].to_list()
titles

['Chronicle of a Death Foretold',
 'Chronicle of a Death Foretold - Teacher Guide by Novel Units',
 'Chronicle of a Death Foretold 1st (first) edition Text Only',
 'Chronicle of a Death Foretold (Chinese Edition)',
 'Chronicle of a Death Foretold (SparkNotes Literature Guide) (SparkNotes Literature Guide Series)',
 'Crónica de una muerte anunciada / Chronicle of a Death Foretold (Spanish Edition)',
 'Chronicle of a Death Foretold by Gabriel García Márquez (Book Analysis): Detailed Summary, Analysis and Reading Guide (BrightSummaries.com)',
 'Crónica de una muerte anunciada [Chronicle of a Death Foretold]',
 'GradeSaver(tm) ClassicNotes Chronicle of a Death Foretold',
 'Destiny & A Chronicle of Deaths Foretold (Books One, Two and Three)',
 'Cien años de soledad (50 Aniversario) / One Hundred Years of Solitude: Illustrated Fiftieth Anniversary edition of One Hundred Years of Solitude (Spanish Edition)',
 'A Study Guide for Gabriel Garcia Marquez\'s "Chronicle of a Death Foretold" (For St

In [23]:
### words and special characters to be removed from titles using spanish and english stop words since some titles appear in spanish
#stop_words = nltk.corpus.stopwords.words('english') + nltk.corpus.stopwords.words('spanish')
special = [':',',','-','(',')','[',']','–','/','#','``',';','.','&','"',"''"]

In [27]:
def normalize_list(title_list):
    term = [word_tokenize(term.lower()) for term in title_list]
    blank_list = []
    for i in range(len(title_list)):
        blank_list.append(' '.join([w for w in term[i] if w not in special]))
    return(np.array(blank_list))

In [28]:
norm_corpus = normalize_list(titles)

In [29]:
norm_corpus

array(['chronicle of a death foretold',
       'chronicle of a death foretold teacher guide by novel units',
       'chronicle of a death foretold 1st first edition text only',
       'chronicle of a death foretold chinese edition',
       'chronicle of a death foretold sparknotes literature guide sparknotes literature guide series',
       'crónica de una muerte anunciada chronicle of a death foretold spanish edition',
       'chronicle of a death foretold by gabriel garcía márquez book analysis detailed summary analysis and reading guide brightsummaries.com',
       'crónica de una muerte anunciada chronicle of a death foretold',
       'gradesaver tm classicnotes chronicle of a death foretold',
       'destiny a chronicle of deaths foretold books one two and three',
       'cien años de soledad 50 aniversario one hundred years of solitude illustrated fiftieth anniversary edition of one hundred years of solitude spanish edition',
       "a study guide for gabriel garcia marquez 's ch

In [30]:
tf = TfidfVectorizer(ngram_range=(1, 2), min_df=0, analyzer='char_wb')
tfidf_matrix = tf.fit_transform(norm_corpus)

In [31]:
index = []
for i in range(len(titles)):
    index.append('Book{}'.format(i+1))

In [32]:
doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim,index=index,columns=index)
doc_sim_df.head()

Unnamed: 0,Book1,Book2,Book3,Book4,Book5,Book6,Book7,Book8,Book9,Book10,...,Book15,Book16,Book17,Book18,Book19,Book20,Book21,Book22,Book23,Book24
Book1,1.0,0.858996,0.809811,0.898119,0.697976,0.831611,0.68673,0.842867,0.822993,0.865894,...,0.697705,0.793894,0.816838,0.824996,0.946549,0.824996,0.874578,0.858359,0.870948,0.822371
Book2,0.858996,1.0,0.752948,0.81105,0.748553,0.808989,0.738397,0.812901,0.798692,0.820105,...,0.667756,0.77906,0.749092,0.756573,0.84726,0.756573,0.86314,0.842993,0.77528,0.81316
Book3,0.809811,0.752948,1.0,0.810859,0.629626,0.744778,0.621437,0.704483,0.706205,0.774129,...,0.616255,0.750159,0.758016,0.748221,0.833449,0.748221,0.773665,0.759215,0.741715,0.726851
Book4,0.898119,0.81105,0.810859,1.0,0.696957,0.833387,0.659242,0.782577,0.773853,0.832231,...,0.681862,0.78495,0.775979,0.783729,0.856036,0.783729,0.831679,0.820889,0.816846,0.794915
Book5,0.697976,0.748553,0.629626,0.696957,1.0,0.718408,0.652015,0.677527,0.726433,0.68392,...,0.59282,0.680363,0.623737,0.629966,0.696257,0.629966,0.709271,0.695291,0.683941,0.712227


In [33]:
doc_sim_df.to_csv('matrix.csv')

In [None]:
doc_sim.flatten()[48]

In [None]:
np.argsort(-doc_sim.flatten())[24:]

In [None]:
doc_sim.flatten()