- Models
    - distilbert
        - https://huggingface.co/distilbert-base-uncased-distilled-squad
        - word count restricted to 512
        - appropriate for page summaries
    - look into
        - longform
            - https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9
    - model open directory
        - https://huggingface.co/

# Setup

In [126]:
# for distilbert - answer questions
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# for choosing the correct article to answer question
from sentence_transformers import SentenceTransformer, util


# for getting wikipedia articles
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')

# data
import pandas as pd

# Data input

In [190]:
# compile corpus
wikipedia_pages = '''
Artificial intelligence
Natural language processing
Deep learning
Supervised learning
Semi-supervised learning
Unsupervised learning
Statistical classification
Regression analysis
Federated learning
k-anonymity
Data anonymization
k-means clustering
DBSCAN
Dimensionality reduction
Silhouette (clustering)
Davies–Bouldin index
Multidimensional scaling
Cluster analysis
Principal component analysis
Isolation forest
Unsupervised learning
Hierarchical clustering
Local outlier factor
Kaiser–Meyer–Olkin test
Bartlett's test


'''

In [187]:
df = pd.DataFrame({'title_input': wikipedia_pages.strip().splitlines()})
df['title'] = ''
df['text'] = ''

In [188]:
for idx, line in df.iterrows():
    page_py = wiki_wiki.page(line['title_input'])
    df.iloc[idx]['title'] = page_py.title
    df.iloc[idx]['text'] = page_py.text

# Match article to input

In [191]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
corpus = df["title"]+df["text"]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
corpus_embeddings.shape

import torch
torch.save(corpus_embeddings, 'corpus_embeddings.pt')
corpus_embeddingsLoaded= torch.load('corpus_embeddings.pt')

query = 'Statistics'
query_embedding = embedder.encode(query, convert_to_tensor=True)

top_k = 10

hits = util.semantic_search(query_embedding, corpus_embeddingsLoaded, top_k=top_k)
hits =hits [0]

for hit in hits:
    hit_id = hit ['corpus_id']
    article_data = df.iloc[hit_id]
    title = article_data ['title']
    print ("-", title, hit ['score'], hit_id)


Downloading: 100%|███████████████████| 1.18k/1.18k [00:00<00:00, 963kB/s]
Downloading: 100%|███████████████████████| 190/190 [00:00<00:00, 148kB/s]
Downloading: 100%|██████████████████| 10.6k/10.6k [00:00<00:00, 5.24MB/s]
Downloading: 100%|███████████████████████| 612/612 [00:00<00:00, 347kB/s]
Downloading: 100%|██████████████████████| 116/116 [00:00<00:00, 91.9kB/s]
Downloading: 100%|███████████████████| 39.3k/39.3k [00:00<00:00, 398kB/s]
Downloading: 100%|██████████████████| 90.9M/90.9M [00:01<00:00, 50.0MB/s]
Downloading: 100%|████████████████████| 53.0/53.0 [00:00<00:00, 31.1kB/s]
Downloading: 100%|██████████████████████| 112/112 [00:00<00:00, 65.2kB/s]
Downloading: 100%|█████████████████████| 466k/466k [00:00<00:00, 945kB/s]
Downloading: 100%|███████████████████████| 350/350 [00:00<00:00, 279kB/s]
Downloading: 100%|██████████████████| 13.2k/13.2k [00:00<00:00, 7.43MB/s]
Downloading: 100%|█████████████████████| 232k/232k [00:00<00:00, 777kB/s]
Downloading: 100%|████████████████████

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

# Ask Questions

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

In [129]:
def distilbert_ask(question, text):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    return tokenizer.decode(predict_answer_tokens)


## Demo

In [148]:
page_title = input('wikipedia page title:')
page_py = wiki_wiki.page(page_title)
if not page_py.exists():
    print('page does not exist')
else:
    print('\npage title:', page_py.title)
    print('\nsummary:\n')
    print(page_py.summary)


wikipedia page title: kmeans



page title: K-means clustering

summary:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via

In [139]:
page_py.categories

{'Category:Articles with short description': Category:Articles with short description (id: ??, ns: 14),
 'Category:CS1 French-language sources (fr)': Category:CS1 French-language sources (fr) (id: ??, ns: 14),
 'Category:CS1 errors: missing periodical': Category:CS1 errors: missing periodical (id: ??, ns: 14),
 'Category:Cluster analysis algorithms': Category:Cluster analysis algorithms (id: ??, ns: 14),
 'Category:Short description is different from Wikidata': Category:Short description is different from Wikidata (id: ??, ns: 14)}

In [138]:
question = input('question:\n')
distilbert_ask(question, page_py.summary)

question:
 which metric?


'squared euclidean distances'

In [147]:
# search by category
c = wiki_wiki.page('Category:Cluster analysis algorithms')
c.categorymembers

{'Affinity propagation': Affinity propagation (id: ??, ns: 0),
 'Automatic clustering algorithms': Automatic clustering algorithms (id: ??, ns: 0),
 'BFR algorithm': BFR algorithm (id: ??, ns: 0),
 'BIRCH': BIRCH (id: ??, ns: 0),
 'Canopy clustering algorithm': Canopy clustering algorithm (id: ??, ns: 0),
 'Chinese whispers (clustering method)': Chinese whispers (clustering method) (id: ??, ns: 0),
 'Cluster-weighted modeling': Cluster-weighted modeling (id: ??, ns: 0),
 'Cobweb (clustering)': Cobweb (clustering) (id: ??, ns: 0),
 'Complete-linkage clustering': Complete-linkage clustering (id: ??, ns: 0),
 'Constrained clustering': Constrained clustering (id: ??, ns: 0),
 'CURE algorithm': CURE algorithm (id: ??, ns: 0),
 'Data stream clustering': Data stream clustering (id: ??, ns: 0),
 'DBSCAN': DBSCAN (id: ??, ns: 0),
 'Expectation–maximization algorithm': Expectation–maximization algorithm (id: ??, ns: 0),
 'FLAME clustering': FLAME clustering (id: ??, ns: 0),
 'Fuzzy clustering': 