# Filtering relevant pages using cosine similarity between page title and query string

In this demo, the efficacy of sentence BERT embeddings and pairwise cosine similarity is demonstrated. This is just an explanation and Common Crawl data has not been used

In [15]:
from sentence_transformers import SentenceTransformer

In [16]:
import requests
import nltk
from bs4 import BeautifulSoup

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

## Extract title of a web page

In [18]:
def get_title(url):
    reqs = requests.get(url)   
    soup = BeautifulSoup(reqs.text, 'html.parser') 
    return soup.title.get_text(' ', strip=True)

## We utilize a pre-trained ROBERTA model to embed our titles (sentences)

ROBERTA is specifically used here because it has the highest Semantic Textual Similarity scores among its counterparts as mentioned by the authors of this library here - https://github.com/UKPLab/sentence-transformers#performance.

**Results may vary as we change the sentence embedder**

In [23]:
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')

100%|██████████| 1.24G/1.24G [01:27<00:00, 14.1MB/s] 


## Some Web pages (both related and unrelated to query)

In [24]:
docs = [get_title('https://www.who.int/news/item/13-10-2020-impact-of-covid-19-on-people\'s-livelihoods-their-health-and-our-food-systems'),
        get_title('https://en.wikipedia.org/wiki/Economic_impact_of_the_COVID-19_pandemic'),
        get_title('https://en.wikipedia.org/wiki/Egg'),
        get_title('https://www.youtube.com/watch?v=0cGLrSpaf4o'),
        'Economic impact of covid pandamic',
        'Adverse effects of covid-19 on global businesses',
        'The US Economy and How Covid-19 has affected it',
        'Impact of the drama',
        'Economic impact of Covid-19' # Query term
       ]

Below are the titles of the web pages specified above

In [25]:
docs

["Impact of COVID-19 on people's livelihoods, their health and our food systems",
 'Economic impact of the COVID-19 pandemic - Wikipedia',
 'Egg - Wikipedia',
 'Coronavirus outbreak: The impact COVID-19 is having on the global economy - YouTube',
 'COVID-19 Notice',
 'Economic impact of covid pandamic',
 'Adverse effects of covid-19 on global businesses',
 'The US Economy and How Covid-19 has affected it',
 'Impact of the drama',
 'Economic impact of Covid-19']

## Cosine Similarity

Pairwise cosine similarity is calculated by sklearn's cosine_similarity function. A tensor containing embeddings of nine sentences would yield a matrix of shape 9x9 which the similarity between each pair of documents.

**Since we only care about each doc-embedding's similarity to the query's embedding (last item in docs list), we take the last row/column of the matrix**

In [26]:
cosine_similarity(model.encode(docs))[:, -1]

array([0.5266193 , 0.75996774, 0.32760704, 0.54800946, 0.7709876 ,
       0.7430022 , 0.6930926 , 0.7830977 , 0.43976095, 1.        ],
      dtype=float32)

### Discussing results:

As seen above, extracted cosine similarities match expectations and exhibit sound logical results.