### DS7337 NLP - HW 6
#### David Wei

# Homework 6

**<u>HW 6:</u>**

1.  Evaluate text similarity of Amazon book search results by doing the following:

    - a. Do a book search on Amazon. Manually copy the full book title (including subtitle) of each of the top 24 books listed in the first two pages of search results. 
    - b. In Python, run one of the text-similarity measures covered in this course, e.g., cosine similarity. Compare each of the book titles, pairwise, to every other one. 
    - c. Which two titles are the most similar to each other? Which are the most dissimilar? Where do they rank, among the first 24 results?

2.	Now evaluate using a major search engine.
    - a.    Enter one of the book titles from question 1a into Google, Bing, or Yahoo!. Copy the capsule of the first organic result and the 20th organic result. Take web results only (i.e., not video results), and skip sponsored results.  
    - b.	Run the same text similarity calculation that you used for question 1b on each of these capsules in comparison to the original query (book title). 
    - c.	Which one has the highest similarity measure? 




#### Import Libs

In [1]:
# python
import os
import numpy as np
import time
import re
import pandas as pd
from tqdm import tqdm
import random
import string
import warnings
warnings.filterwarnings("ignore")
# nltk
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import ToktokTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.chunk.util import tree2conlltags,conlltags2tree

# spaCy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
# nltk corpus
from nltk.corpus import brown
# POS taggers
from textblob import TextBlob
import spacy
# viz & GUI
from IPython.display import Image
from IPython.core.display import HTML 
import matplotlib as plt
# sklearn
from sklearn.preprocessing import minmax_scale
from sklearn.feature_extraction.text import TfidfVectorizer
# web scraping
import requests
import urllib3
from bs4 import BeautifulSoup
from string import punctuation
# from selenium import webdriver
# from selenium.webdriver.common.keys import Keys

## **Part 1**

### Collecting Top 100 "Books to Read in a Lifetime"

Source: https://amz.run/4hg5

In [2]:
books = ['1984 (Signet Classics), Book Cover May Vary', 'A Brief History of Time', 'A Heartbreaking Work of Staggering Genius', 'Diary of a Wimpy Kid, Book 1', 'Dune (Dune Chronicles, Book 1)', 'Fahrenheit 451', 'Me Talk Pretty One Day', ''''Middlesex: A Novel (Oprah's Book Club)''', ''''Midnight's Children: A Novel (Modern Library 100 Best Novels)''', 'The Corrections: A Novel', 'The Devil in the White City: Murder, Magic, and Madness at the Fair That Changed America', 'The Diary of a Young Girl', 'The Poisonwood Bible: A Novel', 'The Power Broker: Robert Moses and the Fall of New York', 'RIGHT STUFF', 'A Long Way Gone: Memoirs of a Boy Soldier', 'The Bad Beginning: Or, Orphans! (A Series of Unfortunate Events, Book 1)', 'A Wrinkle in Time (A Wrinkle in Time Quintet)', 'Selected Stories, 1968-1994', '''Alice's Adventures in Wonderland & Through the Looking-Glass (Bantam Classics)''', '''All the President's Men''', '''Angela's Ashes: A Memoir''', '''Are You There God? It's Me, Margaret.''', '''Bel Canto (Harper Perennial Deluxe Editions)''']
print('Example Book: '+str(books[0]))
print('# of Books: '+str(len(books)))

Example Book: 1984 (Signet Classics), Book Cover May Vary
# of Books: 24


### Normalizing and Converting Book Titles into Vectors

Steps:<br>
1. We will use a TfidfVectorizer that vocabulary and idf, then returns a document-term matrix.This creates a sparse (24x110) matrix which contains all of our books (20) and the total number of unique words (92) and their occurences.<br>
2. The sparse matrix is then converted into a array containing the counts and then transformed into a dataframe to better visualize our term document matrix. 

Source: https://studymachinelearning.com/cosine-similarity-text-similarity-metric/

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# convert list of books matrix of TF-IDF features
vectorizer = TfidfVectorizer(strip_accents='ascii', lowercase=True)
# fits list of book titles to return document-term matrix
tfidf_matrix = vectorizer.fit_transform(books)
print(tfidf_matrix.shape)

(24, 110)


### Visualizing TF-IDF

In [4]:
def create_dataframe(matrix, tokens):
    # doc_names = [f'book_{i+1}' for i, _ in enumerate(matrix)]
    doc_names = [books[i] for i, _ in enumerate(matrix)]
    df = pd.DataFrame(data=matrix, index=doc_names, columns=tokens)
    return df

# convert document-term matrix to array 
tfidf_array = tfidf_matrix.toarray()

# tokenize vectors to get the actual term (vocab) names
tokens = vectorizer.get_feature_names()

df = create_dataframe(tfidf_array,tokens)
df.head(1)

Unnamed: 0,100,1968,1984,1994,451,adventures,alice,all,america,and,...,vary,way,white,wimpy,wonderland,work,wrinkle,york,you,young
"1984 (Signet Classics), Book Cover May Vary",0.0,0.0,0.399772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.399772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculating Linear_Kernel Distance Similarity

Steps:
1. Scikit-learn's version of the linear-kernel is equivalent of Cosine Similarity, only faster when the input is a sparse matrix. 

Source: https://scikit-learn.org/stable/modules/metrics.html

In [5]:
from sklearn.metrics.pairwise import linear_kernel

linear_kernel_similarity_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

doc_names = [books[i] for i, _ in enumerate(linear_kernel_similarity_matrix)]

#### Visualizing Linear Kernel Similarity Matrix

In [6]:
similarity_df = pd.DataFrame(linear_kernel_similarity_matrix)
similarity_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,1.0,0.0,0.0,0.0995,0.080974,0.0,0.0,0.09447,0.0,0.0,...,0.0,0.0,0.066277,0.0,0.0,0.105806,0.0,0.0,0.0,0.0
1,0.0,1.0,0.099242,0.108936,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.082193,0.072562,0.269569,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.099242,1.0,0.092528,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.069813,0.061633,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0995,0.108936,0.092528,1.0,0.10638,0.0,0.0,0.12411,0.0,0.0,...,0.0,0.076633,0.154725,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.080974,0.0,0.0,0.10638,1.0,0.0,0.0,0.101003,0.0,0.0,...,0.0,0.0,0.07086,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Finding Most Similar and Dissimilar Books

Once the similarity matrix which is essentially a 24*24 (A*A) matrix, we will next need to do 2 things:
* 1. Convert all indexes that contain similarities of themselves as NAN
* 2. Create a list of NAN indexes from the flattened similarity matrix array (1d)
* 3. Create a relevant sorted matrix similarity matrx without any NANs
* 4. Using sorted matrix, map back to the original (24x24) matrix to get similarity Values
* 5. Using similarity index and values to return the highest/lowest similarities as well as their positional rankings

Source: https://stackoverflow.com/questions/54437769/how-to-rank-values-in-a-dataframe-with-indexes

In [7]:
# pd.set_option("display.max_rows", None, "display.max_columns", None)
# pd.reset_option('all')

##################### Converting self-similarity indices to NAN #####################
similarity_df_stacked = similarity_df.stack() #
similarity_df_stacked = similarity_df_stacked[similarity_df_stacked.index.get_level_values(0) != similarity_df_stacked.index.get_level_values(1)]
similarity_matrix_flatten = similarity_df_stacked.unstack().to_numpy().flatten()
similarity_matrix_sorted = similarity_matrix_flatten.argsort()

##################### REMOVES NAN from Sorted Index #####################
self_indexes = []
for num, i in enumerate(similarity_matrix_flatten):
    if np.isnan(i):
        self_indexes.append(num)
print(f"NAN Indices: {self_indexes}\n")

##################### Subbing Sorted Matrix #####################
relevant_similarity_matrix = [i for i in similarity_matrix_sorted if i not in self_indexes]
print(f"Sorted Matrix Count: {len(similarity_matrix_sorted)}")
print(f"Subbed Sorted Matrix Count: {len(relevant_similarity_matrix)}")
print(f"Subbed Count Validation: {len(similarity_matrix_sorted) - len(relevant_similarity_matrix)}\n")

##################### Highest and Lowest Similarity Values #####################
high_similarity_index = relevant_similarity_matrix[-1]
low_similarity_index = relevant_similarity_matrix[0]

high_similarity_val = similarity_matrix_flatten[high_similarity_index]
low_similarity_val = similarity_matrix_flatten[low_similarity_index]
print(f'Highest Similarity Values: {high_similarity_val}')
print(f'Lowest Similarity Values: {low_similarity_val}\n')

##################### Highest and Lowest Similarity Book Titles with Rankings #####################
high_val = []
for i in np.unravel_index([high_similarity_index],(24,24)):
    high_val.append(i)

index_a,index_b = high_val[0], high_val[1]
print(f'The Most Similar Books are: "{doc_names[int(index_a)]}", "{doc_names[int(index_b)]}"')
print(f'The Most Similar Books Rankings Are: {index_a}, {index_b}\n')


low_val = []
for i in np.unravel_index([low_similarity_index],(24,24)):
    low_val.append(i)

index_a,index_b = low_val[0], low_val[1]
print(f'The Most Dissimlar Similar Books are: "{doc_names[int(index_a)]}", "{doc_names[int(index_b)]}"')
print(f'The Most Dissimlar Books Rankings Are: {index_a}, {index_b}')

NAN Indices: [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575]

Sorted Matrix Count: 576
Subbed Sorted Matrix Count: 552
Subbed Count Validation: 24

Highest Similarity Values: 0.37727845918634023
Lowest Similarity Values: 0.0

The Most Similar Books are: "The Poisonwood Bible: A Novel", "The Corrections: A Novel"
The Most Similar Books Rankings Are: [12], [9]

The Most Dissimlar Similar Books are: "The Diary of a Young Girl", "Bel Canto (Harper Perennial Deluxe Editions)"
The Most Dissimlar Books Rankings Are: [11], [23]


## **Part 2**

### Search Engine Result Similarity Comparison for Top Search Result and 20th Search Result

Replicating the the process for our original book list, we chose one book at random **'A Heartbreaking Work of Staggering Genius'** and using that book title, compared the similarity between the first web search result capsule and the 20th.

We found that the top search result provided a 14% increase in similarity to our original book title as compared to the 20th result. 

In [8]:
search_result_1 = ['A Heartbreaking Work of Staggering Genius',
                """A Heartbreaking Work of Staggering Genius: Eggers, Dave ...https://www.amazon.com › Heartbreaking-Work-Stagg...
A book that redefines both family and narrative for the twenty-first century. A Heartbreaking Work of Staggering Genius is the moving memoir of a college senior ..."""]

In [9]:
search_result_20 = ['A Heartbreaking Work of Staggering Genius',
                """A Heartbreaking Work Of Staggering Genius Summary ...https://www.supersummary.com › summary
A Heartbreaking Work of Staggering Genius, a memoir by Dave Eggers (2000), was an immediate success both critically and commercially. · Dave's parents die​ ..."""]

In [10]:
#### For Top Web Result
tfidf_matrix_1 = TfidfVectorizer(strip_accents='ascii', lowercase=True).fit_transform(search_result_1)
linear_kernel_similarity_matrix_1 = linear_kernel(tfidf_matrix_1, tfidf_matrix_1)
print(f'Top Search Result Similarity Matrix: \n{linear_kernel_similarity_matrix_1}\n')

#### For 20th  Web Result
tfidf_matrix_2 = TfidfVectorizer(strip_accents='ascii', lowercase=True).fit_transform(search_result_20)
linear_kernel_similarity_matrix_2 = linear_kernel(tfidf_matrix_2, tfidf_matrix_2)
print(f'20th Search Result Similarity Matrix: \n{linear_kernel_similarity_matrix_2}\n')

### Finding % Difference ####
diff = (linear_kernel_similarity_matrix_1[0][1] - linear_kernel_similarity_matrix_2[0][1])/linear_kernel_similarity_matrix_1[0][1]
print(f'% Difference Between Top and 20th: {round(diff*100,2)}%')


Top Search Result Similarity Matrix: 
[[1.         0.61857801]
 [0.61857801 1.        ]]

20th Search Result Similarity Matrix: 
[[1.        0.5294095]
 [0.5294095 1.       ]]

% Difference Between Top and 20th: 14.42%
