# Filtering relevant pages using cosine similarity between page title and query string

In this demo, the efficacy of **sentence BERT's document embeddings** and pairwise cosine similarity is demonstrated. This is just an explanation and Common Crawl data has not been used

In [1]:
from sentence_transformers import SentenceTransformer

In [2]:
import requests
import nltk
import nlp
from bs4 import BeautifulSoup
import trafilatura
import numpy as np

In [3]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string
import re

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vignesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Extract title of a web page

In [6]:
porter = PorterStemmer()
regex = re.compile('[%s]' % re.escape(string.punctuation))
sw = stopwords.words("english")

In [7]:
def clean_text(text):
    text = text.lower()
    text = regex.sub('', text)
    text = ' '.join([word for word in text.split() if (len(word)>=4)])
    text = ' '.join([word for word in text.split() if word not in sw])
    text = ' '.join([porter.stem(word) for word in text.split()])
    return text

In [8]:
def get_page_content(url):
    downloaded = trafilatura.fetch_url(url)
    return clean_text(trafilatura.extract(downloaded))

## Reference (gold standard) article

Our approach is to embed one or more gold standard reference  articles and use them to find how similar the embeddings of other web page contents are to the query we care about. Instead of using sentence BERT embeddings, TF-IDF document vectors are used. However, either approach should yield similar results.

### We utilize tf-idf Vectorizer to embed our reference documents

In [9]:
ref_docs = [
    get_page_content('https://en.wikipedia.org/wiki/Economic_impact_of_the_COVID-19_pandemic'),
    get_page_content('https://www.who.int/news/item/13-10-2020-impact-of-covid-19-on-people\'s-livelihoods-their-health-and-our-food-systems'),
    get_page_content('https://www.brookings.edu/research/ten-facts-about-covid-19-and-the-u-s-economy/'),
    get_page_content('https://www.mckinsey.com/business-functions/risk/our-insights/covid-19-implications-for-business'),
    get_page_content('https://carsey.unh.edu/COVID-19-Economic-Impact-By-State'),
    get_page_content('https://www.frontiersin.org/articles/10.3389/fpubh.2020.00241/full'),
    get_page_content('https://www.pewsocialtrends.org/2020/09/24/economic-fallout-from-covid-19-continues-to-hit-lower-income-americans-the-hardest/'),
    get_page_content('https://www.reuters.com/article/us-usa-economy-poll/u-s-economy-to-slow-in-first-quarter-but-reach-pre-covid-19-levels-in-a-year-reuters-poll-idUSKBN28K00A'),
    get_page_content('https://www.mckinsey.com/business-functions/strategy-and-corporate-finance/our-insights/the-coronavirus-effect-on-global-economic-sentiment')
]

vectorizer = TfidfVectorizer()
docs_tfidf = vectorizer.fit_transform(ref_docs)

## Some Web pages (both related and unrelated to query)

In [10]:
docs = [get_page_content('https://www.who.int/news/item/13-10-2020-impact-of-covid-19-on-people\'s-livelihoods-their-health-and-our-food-systems'),
        get_page_content('https://en.wikipedia.org/wiki/Egg'),
        get_page_content('https://www.youtube.com/watch?v=0cGLrSpaf4o')
]

## Cosine Similarity

**Since we only care about each doc-embedding's similarity to the query's embedding (last item in docs list), we take the last row/column of the matrix**

In [11]:
for doc in docs:
    doc_vector = vectorizer.transform([doc])
    cosineSimilarities = np.average(cosine_similarity(doc_vector, docs_tfidf).flatten())
    print(cosineSimilarities)

0.21832993389217661
0.10310503939757354
0.08803144907161992


### Discussing results:

As seen above, extracted cosine similarities match expectations and exhibit sound logical results.