# Extracting Relevent Data from Wikipedia
This notebook aims to extract the entire content of a set of relevant Wikipedia articles given a sentence / topic tap, some hook for relevant information. This will go into the first layer of our knowledge retrieval processing layer which will select the most relevant document, then the second layer will be responsible for finding the most similar sentence.

Document here can refer to any granularity. At the moment, I am considering it should be at the level of paragraphs.

TO BE DECIDED: Our approach for the knowledge retrieval processing layers. Which layer should we allocate more resources to?

First Layer: More likely to find the paragraph which is, in fact, most relevant, but may compromise on selecting the most relevant sentence form that paragraph

Second Layer: Opposite tradeoff to the first layer

My take: I suspect a crude method like TF-IDF may be accurate enough to get us close enough accuracy compared to NN-based models for the first layer, after which we can use a more sophisticated model using sentence embeddings for sentence similarity.

### Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import wikipedia

## Simulate extracting pos from message

In [None]:
pos = ['cricket', 'Pakistan', 'vs', 'Australia']
search_str = ' '.join(pos)

In [None]:

# import these modules
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print("rocks :", lemmatizer.lemmatize("Austalian"))
print("corpora :", lemmatizer.lemmatize("American"))
 
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

## Test: Use the python wikipedia library to search for relevant articles

In [6]:
wikipedia.search("game Barcelona match yesterday day football", results = 4)

['2010–11 FC Barcelona season',
 'Marc Overmars',
 'Royston Drenthe',
 'José Mourinho']

## Use BeautifulSoup to obtain paragraph data from Wikipedia Articles

In [None]:
def search_and_retrieve_from_wikipedia(search_item, num_results):
    print(f"Fetching data for {search_item}")
    articles = wikipedia.search(search_item, results = num_results)
    print(f"Using the following relevant articles: {articles}")

    documents = []
    document = ""
    for article in articles:
        page = requests.get(f"https://en.wikipedia.org/wiki/{article}")

        # scrape webpage
        soup = BeautifulSoup(page.content, 'html.parser')

        # find and save all occurences of the paragraph tag <p> in HTML
        p_tags = soup.find_all('p')

        document = ""
        for p_tag in p_tags:
            p_text = p_tag.get_text()
            document = f"{document} {p_text}"
        
        # Print a preview of the article
        print(f"{article} article preview: {p_tags[0].get_text()}...")
        
        # Add current compiled document to the list of documents corresponding to articles
        documents.append(document)
    return articles, documents
        
articles, documents = search_and_retrieve_from_wikipedia(search_str, 4)
turn = "I've been good! Just watching some cricket. Have you been watching Pakistan vs Australia?"

## Document Similarity
Plugging in Document Similarity code here since it fits in with data extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import vstack

def process_tfidf_similarity(base_document, documents):
    vectorizer = TfidfVectorizer()

    # To make uniformed vectors, both documents need to be combined first.
    d = [base_document]
    d.extend(documents)
    embeddings = vectorizer.fit_transform(documents)
    embeddings = vstack((vectorizer.transform([base_document]), embeddings))
    
    vectorizer = TfidfVectorizer()
    eds = vectorizer.fit_transform(d)
    print(type(embeddings))
    print(embeddings.shape, len(d))

    cosine_similarities = cosine_similarity(embeddings[0:1], embeddings[1:]).flatten()
    print(cosine_similarities)
    
    cosine_similarities = cosine_similarity(eds[0:1], eds[1:]).flatten()
    print(cosine_similarities)
    return cosine_similarities

In [None]:
"""
# Check overlap between article title and the pos list, if noticeable, then pick that one without checking similarity
best_articles = []
max_overlap = -1
for article in articles:
    title_overlap_count = 0
    w_list = article.split(' ')
    for w in w_list:
        if w in pos:
            title_overlap_count += 1
    
    print(f"overlap for '{article}' is: {title_overlap_count}")
    if title_overlap_count > max_overlap:
        best_articles = [article]
        max_overlap = title_overlap_count
    elif title_overlap_count == max_overlap:
        best_articles.append(article)
"""
#if len(best_articles) > 1:
documents = best_articles
c_sim = process_tfidf_similarity(turn, documents)
selected_article_id = c_sim.argmax()
selected_article = articles[selected_article_id]
print(f"'{articles[selected_article_id]}' has been selected as the most relevant article")
selected_document = documents[selected_article_id]
"""
elif len(best_articles) == 0:
    selected_document = best_articles[0]
else:
    selected_document = None
"""

## Sentence Similarity using MpNet
Worth considering since if we give Topic Modelling the burden of narrowing down the knowledge source, then we can go straight to the Wikipedia article we want
HOWEVER, for practical applications, being able to narrow down documents is still useful.. might want to keep TF-IDF in there..

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

n_gram_range = (5, 5)
stop_words = "english"

# Extract candidate words/phrases
#count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
#candidates = count.get_feature_names()
#print(selected_document)
message = turn
selected_doc_tok_orig = sent_tokenize(selected_document)
selected_doc_tok = [selected_article + " " + s for s in selected_doc_tok_orig]

model = SentenceTransformer('../../models/all-mpnet-base-v2', device='cuda')
doc_embedding = model.encode([message])
candidate_embeddings = model.encode(selected_doc_tok)

top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)

print(distances.shape)
keywords = [selected_doc_tok[index] for index in distances.argsort()[0][-top_n:]]

idx = distances.argsort()[0][-top_n:]
for index in idx:
    print(f"dist: {distances[0][index]}, sent: {selected_doc_tok_orig[index]}, index: {index}")

#print(keywords)
print(distances)

In [None]:
import wikipedia
import nltk
page = wikipedia.page('Lewis Hamilton')
print(len(nltk.tokenize.word_tokenize(page.content)))